About the job
The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads.
Responsibilities
- Design and evolve the architecture of large-scale Kubernetes-based infrastructure platforms to ensure performance, scalability, and resilience for diverse workloads, including microservices, big data, and AI/LLM applications.
- Improve K8s system performance across the control and data planes, including optimizing pod lifecycle, resource orchestration, and system-level throughput under high load.
- Build robust observability and performance analysis frameworks, define K8s system-level SLOs, and lead data-driven tuning and optimization initiatives in production.
- Develop intelligent, unified resource management and scheduling systems (at node & cluster-level) to support a wide range of compute resources in large-scale, cloud-native environments.
- Drive the standardization and optimization of container runtime environments to enhance workload isolation, reliability, and resource efficiency across heterogeneous compute environments.
Qualifications
Minimum
- B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience; Ph.D. degree and strong publication records can be an exception.
- Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
- Familiarity with container and orchestration technologies such as Docker and Kubernetes.
- Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.
Preferred
- Knowledge of big data or machine learning workflows in a Kubernetes environment.
- Experience in developing or contributing to cloud-native open-source projects.
- Hands-on project experience with containerized applications through internships, coursework, or personal projects.
- Familiarity with observability tools and frameworks like Prometheus, Grafana, or distributed tracing systems.