System Debug Engineer Manager, Cloud AI Infrastructure

Google
Kirkland, WA, USA / Austin, TX, USA

About the job

As a part of the Google Cloud Support team, you will ensure customers maximize their investment. As a Systems Debug Engineer, you will be a trusted advisor driving hardware understanding and issue resolution. You will troubleshoot platform challenges, providing expert solutions that enable innovation. You will represent the customer, collaborate with engineering and product teams to drive continuous improvement across global cloud products and services.

Responsibilities

Drive technical team performance across on-call activities and system management by delivering leadership, mentorship, and career development while collaborating with primary responders to address system issues.

Debug platform hardware, silicon, and AI/ML workloads to drive root-cause resolution, develop permanent infrastructure improvements, and build tools for faster diagnosis through troubleshooting and reproduction.

Collaborate cross-functionally with Product, Quality, and Engineering teams to ehance product outcomes, and engage with Site Reliability Engineering (SRE) teams to ensure high-quality production and reliability.

Resolve customer challenges on AI/ML infrastructure through effective diagnosis, resolution, and the implementation of investigation tools to increase productivity for critical reported issues.

Serve as a consultant and subject matter expert for internal stakeholders to resolve deployment and operational obstacles across AI infrastructure environments daily.

Qualifications

Minimum

Bachelor's degree in Computer Science or IT-related field, or equivalent practical experience.

8 years of experience with system design.

5 years of experience managing or leading a team.

5 years of experience with managing technical work, engineering strategy, and roadmaps.

5 years of experience with hardware debug (silicon debug, platform debug, IO interface, memory analysis).

3 years of experience with organizational design.

Preferred

5 years of experience working with vendors or customers.

3 years of experience with leadership development and career growth of employees.

3 years of experience in analyzing and troubleshooting distributed systems.

2 years of CPU, dGPU, or TPU debug or validation experience.

Understanding of memory and high-speed IO technologies.