Software Engineer - Compute Infrastructure (Orchestration & Scheduling)

About the job

The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads. The team is also dedicated to open-sourcing key infrastructure technologies, including projects in the K8s portfolio such as kubewharf, Serverless initiatives like Ray on K8s, and LLM inference control plan project AiBrix.

Responsibilities

- Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.

- Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.

- Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.

- Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.

- Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Qualifications

Minimum

- B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 2+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.

- Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems

- Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.

- Familiarity with container and orchestration technologies such as Docker and Kubernetes.

- Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred

- Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos

- Experience in large scale resource efficiency management and job scheduling development

- Project experience in application scaling, workload co-location, and isolation enhancement

- Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).

- Great communication skills and the ability to work well within a team and across engineering teams.

- Passionate about system efficiency, quality, performance and scalability