About the job
We are seeking an Tech Lead, AML Orchestration to own and advance ByteDance’s distributed orchestration platforms. This leader will oversee a team of Machine Learning Engineers specializing in orchestration and scheduling, guiding the technical strategy for resource efficiency, distributed training, and online inference systems. The role requires deep expertise in large-scale distributed systems, orchestration frameworks, and cross-team collaboration.
Responsibilities
- Lead, mentor, and grow a team of orchestration-focused ML engineers; set technical vision and ensure engineering excellence.
- Design and optimize distributed orchestration and scheduling strategies across large-scale Kubernetes/Godel environments, ensuring efficiency, reliability, and scalability.
- Drive initiatives for autoscaling, resource multiplexing, and preemption across heterogeneous workloads and clusters, including multi-datacenter and multi-cloud setups.
- Partner with framework, platform and research teams to build next-generation distributed training and serving systems for ultra-large, high-dimensional recommendation models.
- Architect robust and elastic online orchestration frameworks for large-scale inference, supporting evolving recommendation and ads models.
- Stay ahead of trends in orchestration, scheduling, and distributed computing, incorporating best practices and emerging technologies.
Qualifications
Minimum
- Bachelor’s degree or higher in Computer Science, Engineering, or a related field.
- 5+ years of experience in large-scale distributed systems, with at least 5 years in a technical leadership role.
- Proficiency in one or more modern programming languages (Golang, Python, C++, or similar).
- Deep understanding of orchestration frameworks (e.g., Kubernetes, Yarn) and distributed systems design principles.
- Proven experience optimizing system performance, resource utilization, and scheduling strategies.
- Strong analytical thinking, problem-solving, and communication skills.
Preferred
- Experience with orchestration or ML frameworks such as Ray, TFX, VeRL, vLLM, or equivalent.
- Familiarity with distributed computing systems (Spark, Flink) and ML pipelines.
- Contributions to open-source scheduling or ML infrastructure projects.
- Hands-on experience with multi-tenant environments and cloud-native architectures.
- Experience collaborating with and leading global, cross-functional teams across different time zones.