Tech Lead, AML Orchestration

About the job

We are seeking an Tech Lead, AML Orchestration to own and advance ByteDance’s distributed orchestration platforms. This leader will oversee a team of Machine Learning Engineers specializing in orchestration and scheduling, guiding the technical strategy for resource efficiency, distributed training, and online inference systems. The role requires deep expertise in large-scale distributed systems, orchestration frameworks, and cross-team collaboration.

Responsibilities

- Lead, mentor, and grow a team of orchestration-focused ML engineers; set technical vision and ensure engineering excellence.

- Design and optimize distributed orchestration and scheduling strategies across large-scale Kubernetes/Godel environments, ensuring efficiency, reliability, and scalability.

- Drive initiatives for autoscaling, resource multiplexing, and preemption across heterogeneous workloads and clusters, including multi-datacenter and multi-cloud setups.

- Partner with framework, platform and research teams to build next-generation distributed training and serving systems for ultra-large, high-dimensional recommendation models.

- Architect robust and elastic online orchestration frameworks for large-scale inference, supporting evolving recommendation and ads models.

- Stay ahead of trends in orchestration, scheduling, and distributed computing, incorporating best practices and emerging technologies.

Qualifications

Minimum

- Bachelor’s degree or higher in Computer Science, Engineering, or a related field.

- 5+ years of experience in large-scale distributed systems, with at least 5 years in a technical leadership role.

- Proficiency in one or more modern programming languages (Golang, Python, C++, or similar).

- Deep understanding of orchestration frameworks (e.g., Kubernetes, Yarn) and distributed systems design principles.

- Proven experience optimizing system performance, resource utilization, and scheduling strategies.

- Strong analytical thinking, problem-solving, and communication skills.

Preferred

- Experience with orchestration or ML frameworks such as Ray, TFX, VeRL, vLLM, or equivalent.

- Familiarity with distributed computing systems (Spark, Flink) and ML pipelines.

- Contributions to open-source scheduling or ML infrastructure projects.

- Hands-on experience with multi-tenant environments and cloud-native architectures.

- Experience collaborating with and leading global, cross-functional teams across different time zones.