Senior GPU Software Performance Engineer — Post‑Training

About the job

Drive the performance of post-training workloads on AMD Instinct™ GPUs. You’ll work across kernels, distributed training, and framework integrations to deliver fast, stable, and reproducible training pipelines on ROCm.

Responsibilities

Lead performance for finetuning and RL training solutions on AMD GPUs.

Improve throughput, memory efficiency, and stability across data, model, and optimizer steps.

Optimize multi-GPU/multi-node training and communication patterns.

Contribute efficient kernels/ops and targeted graph-level optimizations.

Profile, diagnose, and resolve bottlenecks using standard tooling; prevent regressions in CI.

Ship reproducible pipelines and documentation adopted by internal teams and external developers.

Collaborate with framework, compiler, and model teams to land durable improvements.

Qualifications

Minimum

No minimum qualifications listed.

Preferred

Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, or similar).

Hands-on with SFT. LoRA and RL-based training at scale.

Strong PyTorch experience (torch.distributed, FSDP/ZeRO or equivalent).

Proficient in Python and C++; comfortable reading/writing kernels when needed.

Experience with distributed systems and collective communication libraries.

Track record of turning profiles into fixes, upstreaming changes, and documenting results.