About the job
The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads.
Responsibilities
- Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
- Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
- Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
- Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
- Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.
Qualifications
Minimum
- B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 2+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
- Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
- Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
- Familiarity with container and orchestration technologies such as Docker and Kubernetes.
- Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.
Preferred
- Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
- Experience in large scale resource efficiency management and job scheduling development
- Project experience in application scaling, workload co-location, and isolation enhancement
- Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
- Great communication skills and the ability to work well within a team and across engineering teams.
- Passionate about system efficiency, quality, performance and scalability