About the job
In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for candidates interested in diving deep into our fleet of ML servers deployed around the world.
Responsibilities
Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products
Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS
Implement and improve system level testing across the product lifecycle
Develop software which can be maintained, improved upon, documented, tested, and reused
Dive deep on issues at the intersection of hardware and software
Qualifications
Minimum
2+ years of non-internship professional software development experience
1+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
1+ years of administrative experience in networking, storage systems, operating systems and hands-on systems engineering experience
Knowledge of systems engineering fundamentals (networking, storage, operating systems)
Experience programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby
Experience with Linux/Unix
Experience debugging and systems analysis to identify and quickly resolve or mitigate issues
Bachelor's degree in Computer Science, Computer Engineering, or Electrical Engineering
Preferred
Experience in hardware design and validation of components, subsystems and systems
Experience with SOC bring-up and post-silicon validation
Master's degree in Computer Science, Computer Engineering, or Electrical Engineering