Site Reliability Engineer

Southampton HQ - 2 Times a week in Office

Cloud, SaaS, AWS,

Please be advised Security Clearance is required for this position

We are working alongside one of our longstanding clients in helping them recruit a Site Reliability Engineer. The company deliver cutting-edge enterprise software solutions across both cloud and on-premises environments, empowering organisations to enhance customer experiences, maintain regulatory compliance, and proactively fight fraud. The company are trusted by businesses worldwide to drive seamless, intelligent customer interactions.

In this role, you'll oversee the production environment by ensuring system availability and maintaining a comprehensive perspective on overall health. You'll develop tools and software to support and streamline the management of platform infrastructure and key applications. A major focus will be enhancing the dependability, performance, and delivery speed of our software products. You'll also be responsible for analysing and fine-tuning system performance to anticipate user demands and drive innovation. Additionally, you'll take the lead in providing operational support and technical oversight for several large-scale distributed applications.

How You'll Contribute:

Monitor and interpret system and application metrics to fine-tune performance and troubleshoot issues effectively
Collaborate closely with developers to enhance service quality through thorough testing and structured release practices
Engage in architectural discussions, manage platform operations, and contribute to capacity forecasting
Design and implement automated solutions to build resilient, scalable systems
Maintain a strong focus on delivering new features while ensuring stability and adherence to service level goals

You'll Stand Out If You Have:

Practical experience managing large-scale Kubernetes clusters; certifications in Kubernetes are a strong bonus
Hands-on familiarity with the Grafana Observability Suite, including tools like Loki, Mimir, and Tempo
Background in administering or developing with popular monitoring and automation tools such as Splunk, Datadog, PagerDuty, or Rundeck
Experience using configuration management platforms like Ansible, Puppet, or Chef
Professional certifications in cloud DevOps, such as AWS Certified DevOps Engineer or Google Cloud Professional DevOps Engineer, or similar credentials

Do You Have What It Takes?

3-6 years of hands-on experience in a similar role, with a strong emphasis on systems engineering, automation, and service reliability
Proficient in at least one programming language such as Python, Go, Java, or C#, along with scripting skills in Bash or PowerShell
Solid grasp of cloud platforms like AWS, including an understanding of how core services like EC2, ECS, Lambda, and DynamoDB operate under reliability constraints
Practical experience using infrastructure-as-code tools like CloudFormation or Terraform
In-depth knowledge of CI/CD principles and hands-on experience with tools such as Jenkins, GitLab CI/CD, or CircleCI
Strong understanding of containerisation (e.g., Docker, Kubernetes) and microservices architecture
Skilled in using observability and monitoring tools such as Prometheus, Grafana, ELK stack, or AWS CloudWatch
Excellent analytical and troubleshooting abilities, especially within complex distributed systems
Proven experience handling incident management and conducting blameless postmortems, including leading cross-functional teams through resolution and communication during critical outages

Benefits

Life Insurance - 4 x Annual Salary
Private Medical Insurance
Employee Assistance Programme
Hybrid Working - 3 Days from Home
GP Online Assistance Portal.
+ Much More

Please click the "Apply" button to state your interest in this position.

Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

Site Reliability Engineer

Software Developer - C++

Applications Engineer

Polyglot Software Engineer