About the job
This role is for a software engineer in the Distributed Training team for AWS Neuron. This role is responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive scale multi-modal large language models like Llama, Qwen, gpt-oss, DeepSeek and beyond, as well as multi-modal generation models such as Stable Diffusion, Flux, WAN, and many more.
Responsibilities
This role will help lead efforts to optimize distributed training performance on Trainium, with a primary focus on maximizing training throughput, model flops utilization, and efficiency across the Neuron software stack. You will work across PyTorch, JAX, and the Neuron compiler and runtime to enable and tune large-scale training workloads on the latest Trainium instances.
Qualifications
Minimum
3+ years of non-internship professional software development experience; 2+ years of non-intternship design or architecture (design patterns, reliability and scaling) of new and existing systems experience; Experience programming with at least one software programming language
Preferred
3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience; Bachelor's degree in computer science or equivalent