About The Role
We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our client's dynamic, fast-paced team. In this role, you will be instrumental in designing, building, and maintaining the scalable, reliable, and high-performance infrastructure that powers our services. You will work at the intersection of software engineering and infrastructure operations, applying a software engineering mindset to system administration topics. Coming from a start-up background, you'll thrive in our agile environment and play a key part in shaping our platform's future.
Key Responsibilities
- Design, build, and maintain our core infrastructure on AWS and GCP using Infrastructure as Code (IaC) principles with Terraform.
- Develop and manage our Kubernetes clusters, focusing on automation, observability, and scalability to support our microservices architecture.
- Write and maintain software in Golang to automate operational tasks, improve system reliability, and build tooling for the engineering team.
- Participate in an on-call rotation to respond to production incidents, leading blameless post-mortems to drive continuous improvement.
- Champion SRE best practices across the engineering organisation, including SLOs/SLIs, error budgets, and proactive monitoring and alerting.
Required Skills & Experience
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or a similar role.
- Strong proficiency in Golang for automation and building internal tools.
- Extensive hands-on experience with Kubernetes for container orchestration in a production environment.
- Demonstrable experience managing cloud infrastructure (AWS and/or GCP) using Terraform.
- Previous experience working within a start-up or a similarly fast-paced, agile environment.
- Solid understanding of CI/CD pipelines, monitoring, and observability principles.
Nice-to-Have
- Experience with other programming languages such as Python or Rust.
- Familiarity with service mesh technologies like Istio or Linkerd.
- Certifications in AWS, GCP, or Kubernetes (e.g., CKA, CKAD).