About the job
In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for candidates interested in diving deep into our fleet of ML servers deployed around the world.
Responsibilities
Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products
Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS
Implement and improve system level testing across the product lifecycle
Develop software which can be maintained, improved upon, documented, tested, and reused
Dive deep on issues at the intersection of hardware and software
Qualifications
Minimum
4+ years of programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby experience
3+ years of non-internship professional software development experience, or Bachelor's degree or above in engineering or equivalent
3+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience
Experience in computer architecture, or experience with general troubleshooting/debugging of hardware
Experience working with device technologies under development, familiarity with flashing firmware, basic device debugging and familiarity with reading/pulling device logs
BS degree in computer science, computer engineering, or related field, or 4+ years of technical work experience
2+ years of server hardware troubleshooting and repair experience
Preferred
7+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience with concepts such as system architecture, optimization, system dynamics, system analysis, statistical analysis, reliability analysis, and decision making
Experience that includes strong analytical skills, attention to detail, and effective communication abilities, or experience troubleshooting and debugging technical systems and experience with automation and any version control tools
Knowledge of operating systems, hardware, storage, network, security, database administration and cloud infrastructure
Master's degree or above in electrical engineering, computer engineering, or equivalent
Experience with SOC bring-up and post-silicon validation