About the job
As a Neuron Collectives Software Developer, you will work on enhancing collective algorithms and topologies for optimal training performance, using tools like Neuron Explorer to identify bottlenecks, and optimizing collective operations to scale AI compute across the data center. You will also develop and optimize C/C++ implementations of collective communication patterns, and build and maintain analysis frameworks and automation solutions.
Responsibilities
Enhance collective algorithms and topologies for optimal training performance
Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
Monitor and analyze processor, DMA, firmware, and workload metrics
Optimize collective operations to scale AI compute across the data center
Work closely with the hardware team to co-optimize software and Trainium silicon
Develop and optimize C/C++ implementations of collective communication patterns
Investigate and implement improvements for specific training topologies used by modern LLMs
Build and maintain analysis frameworks and automation solutions
Qualifications
Minimum
3+ years of non-internship professional software development experience
2+ years of non-intternship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Experience programming with at least one software programming language
Preferred
3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent