Senior Infrastructure Engineer (Ceph Storage) at Revolut

Role Overview

We are seeking a highly experienced Senior Infrastructure Engineer to join our platform engineering team. This role focuses on the design, implementation, and maintenance of our large-scale, distributed storage infrastructure, with a primary emphasis on Ceph. You will be a key player in ensuring the reliability, performance, and scalability of our storage solutions, which are critical to all our services. The ideal candidate has a deep understanding of Ceph architecture and a proven track record of managing petabyte-scale clusters in a production environment.

Key Responsibilities

Design, deploy, and manage highly available, scalable, and performant Ceph storage clusters across multiple data centers.
Develop and maintain automation for cluster provisioning, monitoring, and lifecycle management using tools like Ansible, Puppet, or SaltStack.
Act as the subject matter expert for all storage-related issues, providing advanced troubleshooting, performance tuning, and root cause analysis.
Collaborate with software engineering and SRE teams to define storage requirements, establish best practices, and integrate storage solutions into our CI/CD pipelines.
Plan and execute capacity planning, disaster recovery strategies, and major version upgrades for the Ceph ecosystem.

Required Skills & Qualifications

6+ years of hands-on experience managing large-scale Ceph clusters (1PB+) in a 24/7 production environment.
Expert-level knowledge of Ceph architecture, including RADOS, RGW, and RBD.
Strong proficiency in Linux/Unix administration and scripting (e.g., Python, Bash).
Experience with infrastructure-as-code and automation tools (e.g., Ansible, Terraform, Puppet).

Nice-to-Have Qualifications

Experience with containerization and orchestration technologies (Docker, Kubernetes).
Familiarity with networking concepts (BGP, LACP) in the context of distributed systems.
Contributions to open-source projects, particularly Ceph.