Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

About the job

This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT-OSS, Quen and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more.

Responsibilities

You will lead efforts to build distributed training support into PyTorch, the Neuron compiler, and runtime stacks. You will enable distribute training strategies as well as use them to optimize models to achieve peak performance and maximize efficiency on AWS custom silicon, including Trainium servers.

Qualifications

Minimum

Experience with training these large models using Pythorch is a must. Distributed training with awareness of strategies like FSDP (Fully-Sharded Data Parallel), PP, Context parallel. Strong software development skills, the ability to deep dive, work effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success in this role.

Preferred

Distributed training libraries like torchtitan, torchtune, HF RL, DeepSeek etc are central to this and extending all of this for the Neuron based system is key. Experience is post-training strategies like DPO/PPO/HF torch-tune will additional strength and aligns with team success.