Principal AI Performance Engineer

AMD
San Jose, CA, USA2026-03-11LAT_LNG

About the job

AMD is looking for a performance-obsessed engineer to drive AI inference performance to the absolute limit on AMD GPUs. You will lead a small, highly technical team and work end-to-end across the stack: profiling, diagnosing, and optimizing leading models on customer-relevant serving configurations (e.g. agentic coding, long-context, high-throughput serving). You move from challenge to challenge, tackling the hardest performance problems across our most strategic customer engagements and leaving behind measurable uplifts and reusable methodology. This is not a sustaining role: every engagement is different, every optimization leaves a lasting impact.

Responsibilities

Drive performance optimization end-to-end across the stack on leading models and customer-relevant serving configurations, closing competitive gaps through kernel and systems-level optimizations

Profile, diagnose, and resolve the hardest cross-stack performance bottlenecks, from GPU kernels and operator dispatch to framework-level scheduling and multi-node communication

Diagnose kernel-level performance issues using profiling tools: identify occupancy limitations, L2 cache thrashing, register pressure, memory coalescing issues, etc, and translate findings into actionable optimizations

Lead customer-facing technical engagements: present findings, recommend optimizations, and deliver measurable performance uplifts

Integrate and optimize custom kernels (Triton, Gluon, CK, PyDSL, ASM, AITER) within serving frameworks, understanding dispatch paths, shape extraction, and backend selection

Optimize multi-node distributed inference: communication-compute overlap, parallelism strategies, and scale-out performance

Develop and refine shared performance optimization methodology that raises the bar across the broader team

Leverage AI agents to accelerate daily work and define best practices for AI-assisted performance engineering

Upstream optimizations into open-source frameworks such as vLLM, SGLang, and PyTorch

Qualifications

Minimum

No minimum qualifications listed.

Preferred

7+ years of software development experience in GPU computing, AI systems, or high-performance computing

Deep hands-on experience with AI serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) and their internals

Strong background in end-to-end workload profiling and bottleneck diagnosis: you can trace from user request to GPU kernel and back

Understanding of GPU kernel performance characteristics: occupancy, register and LDS pressure, memory coalescing, cache utilization, wavefront scheduling, and instruction-level bottlenecks

Ability to read and reason about kernel-level profiling data and translate it into concrete optimization actions. You may not write kernels from scratch daily, but you can tell exactly why one is slow and what needs to change

Understanding of model architectures (transformers, MoE, diffusion), inference paradigms (speculative decoding, prefill-decode disaggregation, continuous batching), and how they map to hardware

Experience with custom kernel development or integration (HIP, CUDA, Triton, CK, or similar)

Understanding of multi-GPU and multi-node distributed systems: scale-up and scale-out topologies, RCCL/NCCL, RDMA, and communication-compute overlap

System and rack-level design awareness: understanding performance tradeoffs across the full deployment stack

Strong proficiency in Python and C++

Customer-facing technical leadership experience: ability to engage with customers, present findings, and drive decisions

Fluent in AI-assisted development: daily user of AI agents and tools, with a mindset toward defining new AI-powered workflows

Strong Linux systems knowledge

Excellent written and verbal English communication skills