Staff SWE, Compiler Architect, System Performance Modeling

About the job

Google Cloud’s mission is to make every business successful through AI by combining cutting-edge technology, infrastructure, and talent. AI/ML software engineers in Cloud bridge the gap between pioneering models and a massive product vehicle reaching billions. Our talent density and AI-powered tools drive rapid development, rooted in a culture of empowerment and a bias to action. In this role, you aren’t just building technology; you’re shaping the frontier of enterprise and driving the evolution of advanced models. Our team is pioneering next-generation performance modeling and simulation technologies that drive multi-year system architecture roadmaps for cutting-edge machine learning accelerators. We are looking for a visionary technical lead to define and own the accuracy and fidelity of our critical co-design simulation platform. Help work on the most complex system-level performance challenges in close collaboration with hardware designers, ML researchers, and product architects, defining the next decade of AI systems at data center scale. If you are excited about building the most powerful ML systems with HW-SW co-design and optimization, please join us and accomplish the missions together. The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide. We're the driving force behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Responsibilities

Establish and maintain high-confidence correlation infrastructure between simulated performance and physical hardware measurements (silicon).

Architect and evolve the simulation layer to support deep exploration of complex, business-critical workloads (e.g., large language models, advanced kernels) and future system topologies.

Identify and solve system-level hardware/software bottlenecks and optimization opportunities at the critical pre-silicon stage.

Provide high-confidence lower-bound performance estimates for future ML systems and architectures.

Qualifications

Minimum

Bachelor's degree or equivalent practical experience.

8 years of experience programming in C++ or Python.

5 years of experience testing, and launching software products.

5 years of experience with performance, large-scale systems data analysis, visualization tools, or debugging.

3 years of experience with software design and architecture.

Preferred

Experience with hardware/software co-design problems, especially performance analysis and bottleneck identification at the pre-silicon stage.

Experience with ML system architectures, including knowledge of compilers, Intermediate Representations (IRs), and hardware accelerators.

Experience enabling and optimizing large-scale ML models (e.g., LLMs, large embedding models).

Ability to lead technical strategy for complex systems, influencing both simulation toolchains and hardware roadmaps.

Proven expertise in constructing custom IR dialects and leveraging open-source compiler frameworks (MLIR, XLA) to solve system level analysis and exploring software-hardware mapping opportunities.

Expertise in architecting high-confidence, high-velocity system performance modeling and correlation infrastructure.