Staff GPU Performance Engineer – AI Frameworks

AMD
San Jose, California, United States2026-03-20LAT_LNG

About the job

AMD is looking for a world class AI frameworks engineer who can provide technical leadership in the development of various AI frameworks in the AMD ecosystem. You will play a pivotal role in developing and optimizing deep learning frameworks for AMD GPUs. You will engage with both internal GPU library teams and open-source maintainers to ensure seamless integration of optimizations, utilizing cutting-edge compiler technologies and advanced engineering principles to drive continuous improvement.

Responsibilities

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like PyTorch, vLLM, SGLang for AMD GPUs in open-source repositories.

Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations.

Develop & Optimize Models: Design and optimize deep learning models using quantization specifically for AMD GPU performance.

Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs.

Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream.

Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions.

Qualifications

Minimum

No minimum qualifications listed.

Preferred

GPU Kernel Development & Optimization: Experienced in designing and optimizing GPU kernels for deep learning on AMD GPUs using HIP, CUDA, and assembly (ASM). Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance.

Experience with AI software framework, such as PyTorch, vLLM, SGLang, benchmarking and profiling.

Experience using profiling and benchmark tooling for large models.

Experience with model optimization, such as low-precision quantization (MXFP4, FP8, INT4), sparsity.

Solid understanding of model architectures, LLMs, MoE, diffusion.

Proficient in C++ programming.

Experience developing and debugging in Python.

Team player and ready to work with a geographically distributed team.