We are hiring a Platform Engineer to help build and evolve the software platform behind large scale AI infrastructure.
This is a hands on engineering role for someone who can write strong Python, work deeply with Kubernetes, design and build platform applications, and operate close to bare metal infrastructure.
You will help build the systems that make GPU compute easier to provision, operate, secure and scale across AI infrastructure environments.
This is not a generic DevOps role. We are not looking for someone who has only maintained pipelines, written Terraform or managed cloud services. We need someone who can build real platform software and understands the infrastructure it runs on.
Design and build platform applications, APIs and services
Write production grade Python for infrastructure and platform use cases
Work with Kubernetes to build scalable platform capabilities
Design and build Kubernetes operators and controllers across compute, storage and networking
Build tooling that improves how bare metal and GPU infrastructure is provisioned, operated and monitored
Translate operational pain points into scalable platform features
Improve platform reliability, observability and performance
Work across Linux, networking, storage and distributed systems
Collaborate with product, security, infrastructure, networking and compute teams
Help build the platform layer for AI infrastructure designed to operate at industrial scale
Strong Python engineering experience
Strong hands on Kubernetes experience
Experience designing and building applications, APIs, services or internal platform tooling
Bare metal infrastructure experience
Strong Linux systems experience
Good understanding of networking, storage and distributed systems
Experience building production grade systems with proper testing, CI/CD, code reviews and clean engineering standards
A practical engineering mindset and the ability to solve real infrastructure problems through software
Experience building Kubernetes operators, CRDs or controllers
Exposure to GPU infrastructure, HPC or high performance compute
Experience with Go or Rust
Knowledge of confidential computing, including TEE, SEV, TDX or CoCo
Experience with Ceph or distributed storage systems
Familiarity with Prometheus, Grafana or OpenTelemetry
Experience with BGP, RDMA or high performance networking
Exposure to NVIDIA GPU infrastructure or bare metal cloud environments
AI infrastructure is constrained by the ability to deliver reliable compute at scale. This role sits in the platform layer that connects software engineering with real infrastructure.
You will help build systems that run close to the metal, across Kubernetes, Linux, networking, storage and GPU compute.
This is a role for someone who wants to build the infrastructure layer behind AI, not just operate tools around it.