About the job
This role is for a software engineer in the Distributed Training team for AWS Neuron. This role is responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive scale multi-modal large language models like Llama, Qwen, gpt-oss, DeepSeek and beyond, as well as multi-modal generation models such as Stable Diffusion, Flux, WAN, and many more.
Responsibilities
lead efforts to optimize distributed training performance on Trainium, with a primary focus on maximizing training throughput, model flops utilization, and efficiency across the Neuron software stack. You will work across PyTorch, JAX, and the Neuron compiler and runtime to enable and tune large-scale training workloads on the latest Trainium instances.
Qualifications
Minimum
5+ years of non-internship professional software development experience
5+ years of programming with at least one software programming language experience
5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Experience as a mentor, tech lead or leading an engineering team
Preferred
Bachelor's degree in computer science or equivalent
Machine Learning knowledge in frameworks and end to end model training.