About the job
This role is for a Senior Machine Learning Engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement and performance tuning of a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT-OSS, Quen and Llama, as well as Stable Diffusion, Vision Transformers (ViT) and many more.
Responsibilities
You will lead efforts to build distributed training support into PyTorch, the Neuron compiler, and runtime stacks. You will enable distribute training strategies as well as use them to optimize models to achieve peak performance and maximize efficiency on AWS custom silicon, including Trainium servers.
Qualifications
Minimum
Experience with training these large models using Pythorch is a must. Distributed training with awareness of strategies like FSDP (Fully-Sharded Data Parallel), PP, Context parallel. Strong software development skills, the ability to deep dive, work effectively within cross-functional teams, and a solid foundation in Machine Learning are critical for success in this role.
Preferred
Distributed training libraries like torchtitan, torchtune, HF RL, DeepSeek etc are central to this and extending all of this for the Neuron based system is key. Experience is post-training strategies like DPO/PPO/HF torch-tune will additional strength and aligns with team success.