This is a hybrid role, 2 days a week in the office in either: Blackpool, Manchester, Leeds, Sheffield or Newcastle
Our Government client is currently looking for an SC cleared Site Reliability Engineering for a 6 months contract at £700pd inside IR35.
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. They work with the application, infrastructure and database developers to increase reliability and reduce down time and uptime targets. SREs ensure that the client's services (both internally critical and externally-visible systems) have reliability and uptime that meet business and citizen's needs. They are also responsible for continuous improvement, while maintaining application capacity and performance. They are involved from the initial application idea all the way through the application lifecycle to decommissioning.
Main skills and tasks:
• Responsible for contributing authoritative advice and guidance to others in the organisation and externally.
• Develop your technical skills and you must have the following knowledge:
- Terraform, Ansible, Python, Bash, Gitlab CI/CD, AWS or Azure managed services, Monitoring services (Cloudwatch/Prometheus/Azure Monitor), Containers
• Design and develop the techniques for improving application reliability, run books, knowledge transfer to the UXCC, and ongoing SRE strategy within your Functional and Professional Communities
• Act as the focal point for the investigation and resolution of major or complex incidents for the service, ensuring people with the right skills and expertise are proactively available to respond effectively
• Assess the impact of change requests in consultation with stakeholders, providing technical expertise and authorising the implementation of subsequent changes
• Be on-call for applications that require out-of-hours SRE coverage
• Undertake comprehensive analysis of performance trends to identify root cause analysis, progressing opportunities to improve reliability, security, capability of infrastructure, application and site services
• Actively engage with senior stakeholders and provide clear communication of incident resolution and service improvements.
• Assure critical changes to the applications and supporting infrastructure
• Conduct code assessments, with a view to correcting errors and providing recommendations for reliability improvements
• Manage the team backlog for the applications for which you are accountable
• Coach and mentor application development and operations engineers in the practice and techniques of SRE
They are looking for someone who:
1. Has a deep understanding of the technical concepts required in their role and understands how these fit into the wider technical landscape. Understands the limitations of digital technology
2. Leads the assessment, analysis, planning and design of release packages, including assessment of risk. Liaises with business and IT partners on release scheduling and communication of progress. Conducts post release reviews. Ensures release processes and procedures are applied and that releases can be rolled back as needed. Identifies, evaluates and manages the adoption of appropriate release and deployment tools, techniques and processes (including automation)
3. Collaborates with others when necessary to review specifications and uses these agreed standards and tools to design, code, test, correct and document programmes or scripts of medium to high complexity, using the right standards and tools.
• Understanding of security engineering and security best practice
• Ability to architect and administer scalable, cloud-native and on premise applications
• Strong time management, and change management skills.
• Strong communications skills across multiple stakeholder types
• Skilled knowledge and ability in modifying and maintaining systems and code developed by other engineers
• The ability to lead engineers in a complex, multi-disciplinary environment, delivering products within specific timescales
• Desirable – Kubernetes, System administration (RHEL, Windows Server, etc), Network configuration (DNS, routing, load balancing, etc)