Senior AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations

Amazon
Austin, TX, USA2026-03-20ONSITE

About the job

In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for candidates interested in diving deep into our fleet of ML servers deployed around the world.

Responsibilities

Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products

Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS

Implement and improve system level testing across the product lifecycle

Develop software which can be maintained, improved upon, documented, tested, and reused

Dive deep on issues at the intersection of hardware and software

Qualifications

Minimum

4+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience

3+ years of non-internship professional software development experience, or Bachelor's degree or above in engineering or equivalent

3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience

Experience in computer architecture, or experience with general troubleshooting/debugging of hardware

Experience working with device technologies under development, familiarity with flashing firmware, basic device debugging and familiarity with reading/pulling device logs

BS degree in computer science, computer engineering, or related field, or 4+ years of technical work experience

2+ years of server hardware troubleshooting and repair experience

Preferred

7+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience

Experience with concepts such as system architecture, optimization, system dynamics, system analysis, statistical analysis, reliability analysis, and decision making

Experience that includes strong analytical skills, attention to detail, and effective communication abilities, or experience troubleshooting and debugging technical systems and experience with automation and any version control tools

Knowledge of operating systems, hardware, storage, network, security, database administration and cloud infrastructure

Master's degree or above in electrical engineering, computer engineering, or equivalent

Experience with SOC bring-up and post-silicon validation