Tech Lead Manager ML Optimization

About the job

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. The Waymo ML Infrastructure team accelerates Waymo’s mission, by building the best ecosystem for sustainably innovating and shipping ML powered intelligence. We are looking for an experienced senior TLM to join our team. In this critical role, you will lead the development and enable efficient deployment for large-scale machine learning models using state of the art advanced AI infrastructure. You will work cross functionally at the intersection of data engineering, model development, and Datacenter + on-device low-latency deployments, ensuring seamless integration across teams and technologies to power efficient innovation.

Responsibilities

Take ownership of improving model efficiency on different platforms and drive the model system codesign practice that meet both technical and business requirements; Proactively study the SOTA model architectures and optimizations from the community and Google, for World Models, Diffusion + flow matching techniques, and translate them into measurable technical deliverables in Waymo’s onboard driving stack; Dev tooling innovation for model performance inspector in highly distributed training/inference setups, apply roofline analysis, understand the efficiency headrooms and drive work groups to deliver the optimizations and meet the system requirements; Innovate high performance optimizations and tools for various models and large-scale training/inference including on future next-gen TPUs and low-bit precision training/inference setup, and ensure all system components align towards achieving high performance and goodput goals; Guide efforts across multiple teams and organizations to ensure seamless integration of data generation, model development, and deployment pipelines; Act as a mentor to junior engineers, helping to grow their technical expertise and foster a culture of collaboration and engineering excellence; Manage the IC performance for a medium size team of ~10 engineers.

Qualifications

Minimum

10+ years of professional software engineering experience, with at least 5 years in machine learning infrastructure such as developing, training, deploying, and optimizing large-scale machine learning systems; Experienced using ML accelerator profiling tools to uncover performance bottlenecks; Solid experience in the development and optimization of machine learning infrastructure tools like DeepSpeed, PyTorch, TensorFlow, JAX, or similar frameworks; Deep understanding of state-of-the-art machine learning models and architectures such as autoregressive and diffusion transformers and familiarity with custom-kernels for diverse h/w compute based efficiency; Strong leadership skills with experience navigating cross-functional teams and providing technical leadership projects across multiple organizations; Excellent communication skills, both verbal and written, with the ability to translate complex technical concepts for a broad audience.

Preferred

A Master’s or PhD in Computer Science, Engineering, or a related field is preferred.