Software Engineer II- AI/ML, AWS Neuron

About the job

The Annapurna Labs team at Amazon Web Services (AWS) builds AWS Neuron, the software development kit used to accelerate deep learning and GenAI workloads on Amazon’s custom machine learning accelerators, Inferentia and Trainium. The AWS Neuron SDK, developed by the Annapurna Labs team at AWS, is the backbone for accelerating deep learning and GenAI workloads on Amazon's Trainium ML accelerators. This comprehensive toolkit includes an ML compiler, runtime, and application framework that seamlessly integrates with popular ML frameworks like PyTorch and JAX enabling unparalleled ML inference and training performance.

Responsibilities

Design, develop, and optimize machine learning models and frameworks for deployment on custom ML hardware accelerators.

Participate in all stages of the ML system development lifecycle including distributed computing based architecture design, implementation, performance profiling, hardware-specific optimizations, testing and production deployment.

Build infrastructure to systematically analyze and onboard multiple models with diverse architecture.

Analyze and optimize system-level performance across multiple generations of Neuron hardware

Conduct detailed performance analysis using profiling tools to identify and resolve bottlenecks

Conduct comprehensive testing, including unit and end-to-end model testing with continuous deployment and releases through pipelines.

Work directly with customers to enable and optimize their ML models on AWS accelerators

Collaborate across teams to develop innovative optimization techniques

Qualifications

Minimum

- 3+ years of non-internship professional software development experience, or Bachelor's degree or above in engineering or equivalent

- 3+ years of non-intternship design or architecture (design patterns, reliability and scaling) of new and existing systems experience

- Experience with Machine Learning and Large Language Model fundamentals, including architecture, training/inference lifecycles, and optimization of model execution

- Knowledge of system performance, memory management, and parallel computing principles

- Experience in debugging, profiling, and implementing software engineering best practices in large-scale systems, or experience debugging, profiling, and implementing best software engineering practices in large-scale systems

Preferred

Strong software development using Python, C++, System level programming and ML knowledge are both critical to this role.