About the job
The Waymo ML Infrastructure team works with Research and Production teams to develop models in Perception and Planning that are core to our autonomous driving software. We help our partners by offering the best solutions for the entire model development lifecycle. These solutions are developed in close collaboration with teams at Google. They are geared towards both scaling models and solving problems unique to ML for autonomous driving. You will improve the runtime efficiency of input data pipelines for large-scale training workloads. This is a unique opportunity to work on ML systems and improve on our model training processes.
Responsibilities
Design, and improve distributed input data pipelines for large-scale ML training workloads.
Collaborate with researchers and ML engineers to resolve bottlenecks in data pipeline performance.
Improve runtime goodput of ML training workload, including optimizing input data processing systems, ensuring scalability and reliability across distributed environments.
Implement and maintain advanced ML infrastructure tools, including ML Pathways, Grain, JAX, and TensorFlow.
Evaluate and integrate modern technologies to enhance the performance and scalability of ML systems.
Promote best practices for distributed systems architecture and contribute to technical leadership within the team.
Qualifications
Minimum
B.S. in Computer Science, Math, or 5+ years equivalent real-world experience.
Proficient in distributed systems design with an understanding of ML data pipeline optimization.
Experience with ML frameworks, including TensorFlow and JAX.
Hands-on experience libraries like Grain or tf.data service.
Solid programming skills in Python and C++.
Practical familiarity with profiling tools to uncover performance bottlenecks.
Preferred
MS in Computer Science, Math
Familiarity with distributed dataflow frameworks like ML Pathways