Senior Software Engineer, Training Efficiency

About the job

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver™—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo Driver powers Waymo’s fully autonomous ride-hail service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million rider-only trips, enabled by its experience autonomously driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states. The Waymo ML Infrastructure team works with Research and Production teams to develop models in Perception and Planning that are core to our autonomous driving software. We help our partners by offering the best solutions for the entire model development lifecycle. These solutions are developed in close collaboration with teams at Google. They are geared towards both scaling models and solving problems unique to ML for autonomous driving. You will improve the runtime efficiency of input data pipelines for large-scale training workloads. This is a unique opportunity to work on ML systems and improve on our model training processes.

Responsibilities

Design, and improve distributed input data pipelines for large-scale ML training workloads.

Collaborate with researchers and ML engineers to resolve bottlenecks in data pipeline performance.

Improve runtime goodput of ML training workload, including optimizing input data processing systems, ensuring scalability and reliability across distributed environments.

Implement and maintain advanced ML infrastructure tools, including ML Pathways, Grain, JAX, and TensorFlow.

Evaluate and integrate modern technologies to enhance the performance and scalability of ML systems.

Promote best practices for distributed systems architecture and contribute to technical leadership within the team.

Qualifications

Minimum

B.S. in Computer Science, Math, or 5+ years equivalent real-world experience.

Proficient in distributed systems design with an understanding of ML data pipeline optimization.

Experience with ML frameworks, including TensorFlow and JAX.

Hands-on experience libraries like Grain or tf.data service.

Solid programming skills in Python and C++.

Practical familiarity with profiling tools to uncover performance bottlenecks.

Preferred

MS in Computer Science, Math

Familiarity with distributed dataflow frameworks like ML Pathways