Software Engineer III, Infrastructure, Cloud AI

About the job

Join our team to improve the Accelerated Linear Algebra (XLA) compiler stack used for a wide range of machine learning models on TPU, GPU, and CPU hardware. You will work on projects to enhance compiler stability and usability across different frameworks and hardware, from research to production serving. Compiler expertise isn't required, making this a good project to onboard onto ML infrastructure for people with an interest in compilers and ML runtime systems. Our team focuses on productionizing the integration of the XLA compiler and ML frameworks, critical for running machine learning models efficiently on Google's accelerator hardware (TPUs) as well as GPUs and CPUs. We work to standardize compiler interfaces and integration, improve stability and ensure model consistency between development and production. We collaborate with ML framework, compiler, runtime, and other infrastructure teams. Our efforts support most ML teams within Google and power Google Cloud's ML offerings.

Responsibilities

Write and test product or system development code.

Understand how accelerator compilers and runtimes interact at a high level.

Develop and apply metrics to understand the problem you are solving and gage status/success as needed.

Close infrastructure (infra) gaps to help with ML stack maturation (e.g., reduce a number of ways something is done, improve reproducibility, improve tooling, improve usability).

Participate in design reviews with peers and stakeholders to decide amongst available technologies.

Qualifications

Minimum

Bachelor’s degree or equivalent practical experience.

2 years of experience with software development in C++.

2 years of experience with developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage or hardware architecture.

2 years of experience testing, maintaining, or launching software products, and 1 year of experience with software design and architecture.

Preferred

Experience with machine learning model training and serving.

Experience with C++ development.

Experience working across or understanding different parts of the software stack (e.g., ML frameworks, compilers, ML runtimes, or systems).

Interest in compiler technology, ML runtime systems, or low-level software optimization.