Role Overview
We are seeking a highly experienced Senior Infrastructure Engineer to join our platform engineering team. This role focuses on the design, implementation, and maintenance of our large-scale, distributed storage infrastructure, with a primary emphasis on Ceph. You will be a key player in ensuring the reliability, performance, and scalability of our storage solutions, which are critical to all our services. The ideal candidate has a deep understanding of Ceph architecture and a proven track record of managing petabyte-scale clusters in a production environment.
Key Responsibilities
- Design, deploy, and manage highly available, scalable, and performant Ceph storage clusters across multiple data centers.
- Develop and maintain automation for cluster provisioning, monitoring, and lifecycle management using tools like Ansible, Puppet, or SaltStack.
- Act as the subject matter expert for all storage-related issues, providing advanced troubleshooting, performance tuning, and root cause analysis.
- Collaborate with software engineering and SRE teams to define storage requirements, establish best practices, and integrate storage solutions into our CI/CD pipelines.
- Plan and execute capacity planning, disaster recovery strategies, and major version upgrades for the Ceph ecosystem.
Required Skills & Qualifications
- 6+ years of hands-on experience managing large-scale Ceph clusters (1PB+) in a 24/7 production environment.
- Expert-level knowledge of Ceph architecture, including RADOS, RGW, and RBD.
- Strong proficiency in Linux/Unix administration and scripting (e.g., Python, Bash).
- Experience with infrastructure-as-code and automation tools (e.g., Ansible, Terraform, Puppet).
Nice-to-Have Qualifications
- Experience with containerization and orchestration technologies (Docker, Kubernetes).
- Familiarity with networking concepts (BGP, LACP) in the context of distributed systems.
- Contributions to open-source projects, particularly Ceph.