About the job
Are you passionate about Kubernetes and AI and want to help build the best platform for ML/AI infrastructure? Do you thrive when your work directly empowers teams to push the boundaries of what's possible? We're the Platform API team within NVIDIA's DGX Cloud organization - a collaborative group of cloud platform engineers, architects, and SREs who are passionate about building and nurturing the declarative, Kubernetes-native control plane that powers GPU-accelerated infrastructure across multiple cloud providers. Together, we're empowering the world's leading AI teams to train and deploy at datacenter scale.
Responsibilities
Develop software systems to support large scale deployments of cloud infrastructure
Design and develop APIs to support Infrastructure as Code (IaC) automation and deployment workflows.
Responsible for contributing to multiple source code projects to fulfill NVIDIA requirements with software services
Work and collaborate with engineering managers, architects, designers, and frontend engineers to deliver high quality software
Automate the validation of software solutions with unit and integration tests
Participate in the ownership and health of CI/CD pipelines from dev to production environments
Collaborate with other specialists for feedback on proposed designs and product direction
Openly share successes and failures in a no blame environment
Qualifications
Minimum
BS in Computer Science, Information Systems, Computer Engineering or equivalent experience
5+ years of proven experience in large scale software development
Experience building and shipping services on Kubernetes
Background with using and chipping in to open-source projects
Collaborated with teams to write software to support cloud services at scale
Programming experience in a relevant language, e.g. Golang, Python
Communicate design and quality strategy in written, visual, and oral formats
Experience with a wide range of modern infrastructure tools and technologies
Preferred
Experience with Kubernetes Cluster API, Terraform, Tinkerbell, and other infrastructure tooling
Practical experience with Azure, GCP, or AWS
Capable of refactoring software to run in systems such as Kubernetes
Ability to discuss and work with CSI, CNI, and CRI and/or familiarity with the CNCF and the tooling across the ecosystem
Upstream contribution in open source projects