AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations

About the job

In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for candidates interested in diving deep into our fleet of ML servers deployed around the world.

Responsibilities

Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products

Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS

Implement and improve system level testing across the product lifecycle

Develop software which can be maintained, improved upon, documented, tested, and reused

Dive deep on issues at the intersection of hardware and software

Qualifications

Minimum

2+ years of non-internship professional software development experience

1+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience

1+ years of administrative experience in networking, storage systems, operating systems and hands-on systems engineering experience

Knowledge of systems engineering fundamentals (networking, storage, operating systems)

Experience programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby

Experience with Linux/Unix

Experience debugging and systems analysis to identify and quickly resolve or mitigate issues

Bachelor's degree in Computer Science, Computer Engineering, or Electrical Engineering

Preferred

Experience in hardware design and validation of components, subsystems and systems

Experience with SOC bring-up and post-silicon validation

Master's degree in Computer Science, Computer Engineering, or Electrical Engineering