SoC Quality and Reliability Engineer, Google Cloud

Google
Sunnyvale, CA, USA

About the job

In this role, you’ll work to shape the future of AI/ML hardware acceleration. You will have an opportunity to drive cutting-edge TPU (Tensor Processing Unit) technology that powers Google's most demanding AI/ML applications. You’ll be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's TPU. You'll contribute to the innovation behind products loved by millions worldwide, and leverage your design and verification expertise to verify complex digital designs, with a specific focus on TPU architecture and its integration within AI/ML-driven systems. As a Quality and Reliability Engineer for Google Cloud, you will lead the development of design-for-reliability guidelines and drive the adoption of advanced technologies to optimize silicon production and reliability. You will be responsible for ensuring that high performance computing (HPC) SOC products meet stringent quality requirements by collaborating across design, manufacturing, and hardware teams to execute comprehensive test plans. Additionally, you will own the cross-functional investigation and root-cause analysis of integrated circuit (IC) issues to develop effective solutions in a production environment.

Responsibilities

Own the development of design-for-reliability guidelines, collaborating with subject area experts (e.g., DFBI, SER, EMIR, PERC, HVDRC, Margining, etc.).

Define and execute silicon and package qualification activities (HTOL, ELFR, ESD/LU, b/HAST, THB, etc.).

Extract, manipulate, and analyze large volumes of data from silicon and package qualification programs, high volume manufacturing, and field returns to identify failure mechanisms, reliability trends, and opportunities for yield and QnR improvement.

Own cross-functional investigation of IC quality and reliability issues to identify root causes and develop solutions (e.g., return marchandize authorization (RMA) triage, analytics, failure analysis, etc.).

Develop and implement physics-based statistical quality and reliability models (ELF, TDDB, NBTI, HCI, Time zero failures, etc.) to predict package and silicon failure mechanisms, degradation patterns, and lifetime behaviors.

Qualifications

Minimum

Bachelor's degree in Electrical Engineering, Mechanical Engineering, Computer Engineering or Computer Science, with an emphasis on computer architecture.

Experience with semiconductor manufacturing, chip design, or device physics.

Experience in silicon or hardware quality, such as reliability testing and product lifecycle standards.

Experience working with foundry or advanced packaging processes.

Preferred

Master's degree or PhD in Electrical Engineering, Mechanical Engineering, Computer Engineering or Computer Science, with an emphasis on computer architecture.

Experience in semiconductor reliability and manufacturing processes (fab, assembly, test), or IC and packaging failure mechanisms and related failure analysis.

Experience in data analytics, especially to identify commonalities and abnormalities.

Experience in chiplets, and high power devices.

Familiarity with test methods and hardware for silicon qualification (e.g., High-Temperature Operating Life (HTOL) chambers, Electrostatic Discharge (ESD), Latch-Up (LU), etc