About the job
The AWS Neuron Collectives team is seeking a Software Engineer to optimize collective operations for AWS Trainium. Trainium is one of Amazon's highest priority initiatives, powering the frontier AI models being trained today. Collectives are the critical operations that scale AI compute across the data center. You'll work in depth to optimize compute for the specific topologies used to train modern LLMs. Working closely with the hardware team, you'll push for maximum performance using C/C++, interfacing with DMA and firmware and investigating detailed topologies. You'll analyze current collective algorithms using publicly accessible tools like Neuron Explorer and optimize these to fully utilize compute and bus bandwidth to scale across the data center. This is a unique opportunity to impact how AI training runs at AWS scale, while growing your technical breadth and depth.
Responsibilities
Enhance collective algorithms and topologies for optimal training performance
Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
Monitor and analyze processor, DMA, firmware, and workload metrics
Optimize collective operations to scale AI compute across the data center
Work closely with the hardware team to co-optimize software and Trainium silicon
Develop and optimize C/C++ implementations of collective communication patterns
Investigate and implement improvements for specific training topologies used by modern LLMs
Build and maintain analysis frameworks and automation solutions
Qualifications
Minimum
Experience building complex software systems that have been successfully delivered to customers
Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
Bachelor's degree in computer science or equivalent
Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
Experience in development in the last 3 years, or experience in embedded development in C/C++
Preferred
Master's degree in computer science or equivalent
Experience with hardware/software integration and real-time systems
Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks