Senior Infrastructure Engineer (Ceph Storage)

Revolut

London
Permanent
Remote
Ceph StorageInfrastructure Automation (Ansible/Terraform)Kubernetes

Role Overview

We are seeking a highly experienced Senior Infrastructure Engineer to join our platform engineering team. This role focuses on the design, implementation, and maintenance of our large-scale, distributed storage infrastructure, with a primary emphasis on Ceph. You will be a key player in ensuring the reliability, performance, and scalability of our storage solutions, which are critical to all our services. The ideal candidate has a deep understanding of Ceph architecture and a proven track record of managing petabyte-scale clusters in a production environment.

Key Responsibilities

  • Design, deploy, and manage highly available, scalable, and performant Ceph storage clusters across multiple data centers.
  • Develop and maintain automation for cluster provisioning, monitoring, and lifecycle management using tools like Ansible, Puppet, or SaltStack.
  • Act as the subject matter expert for all storage-related issues, providing advanced troubleshooting, performance tuning, and root cause analysis.
  • Collaborate with software engineering and SRE teams to define storage requirements, establish best practices, and integrate storage solutions into our CI/CD pipelines.
  • Plan and execute capacity planning, disaster recovery strategies, and major version upgrades for the Ceph ecosystem.

Required Skills & Qualifications

  • 6+ years of hands-on experience managing large-scale Ceph clusters (1PB+) in a 24/7 production environment.
  • Expert-level knowledge of Ceph architecture, including RADOS, RGW, and RBD.
  • Strong proficiency in Linux/Unix administration and scripting (e.g., Python, Bash).
  • Experience with infrastructure-as-code and automation tools (e.g., Ansible, Terraform, Puppet).

Nice-to-Have Qualifications

  • Experience with containerization and orchestration technologies (Docker, Kubernetes).
  • Familiarity with networking concepts (BGP, LACP) in the context of distributed systems.
  • Contributions to open-source projects, particularly Ceph.