Senior Software Engineer - AI Triton Communication

About the job

Triton is a widely adopted language and compiler for high-performance GPU kernels, powering major AI frameworks such as PyTorch, vLLM, and SGLang. As AI workloads increasingly scale across multiple GPUs and nodes, first-class support for distributed execution and communication in Triton is strategically critical to enabling efficient large-scale training and inference on AMD Instinct Accelerators. In this role, you will advance the Triton compiler and runtime stack for AMD CDNA and next-generation GPU architectures by building native distributed execution and communication capabilities. You will develop compiler and runtime infrastructure that enables efficient inter-GPU communication, scalable execution, and optimal hardware utilization. You will work across compiler, runtime, and hardware layers, collaborating closely with GPU architecture and software teams to help establish AMD GPUs as a best-in-class platform for Triton-based distributed AI.

Responsibilities

Design and develop native distributed communication and execution capabilities within the Triton AMDGPU backend, enabling scalable multi-GPU execution for large-scale AI workloads

Design and implement Triton compiler and runtime mechanisms for native GPU-initiated communication, including collective operations, remote memory access, synchronization, and distributed execution primitives

Drive performance optimization across compute and communication, including inter-GPU data movement, communication/computation overlap, memory hierarchy utilization, and GPU-driven scheduling efficiency

Develop and optimize distributed Triton kernels and execution models to achieve high performance, scalability, and efficient hardware utilization for AI workloads

Analyze, profile and debug complex cross-stack issues spanning Triton compiler, runtime, ROCm stack, and GPU hardware execution

Collaborate closely with GPU architecture, compiler, runtime, and performance teams to co-design and enable next-generation distributed GPU programming and execution capabilities

Contribute to open-source Triton and ROCm distributed ecosystem, driving innovation in distributed GPU computing

Qualifications

Minimum

No minimum qualifications listed.

Preferred

5+ years of experience in compiler development, GPU software, distributed systems, or performance engineering

Familiarity or hands-on experience with Triton compiler and runtime

Deep understanding of modern GPU architectures, including execution model, memory hierarchy (LDS, L2, HBM), scheduling, occupancy, and hardware performance characteristics

Good understanding of GPU runtime systems, communication stacks, and multi-GPU interconnects such as XGMI, NVLink, PCIe, or InfiniBand and their performance implications

Familiarity with distributed GPU communication libraries such as RCCL, NCCL, NVSHMEM, rocSHMEM, or MPI and similar technologies

Experience developing, optimizing, and scaling workloads across multiple GPUs, including inter-GPU communication, synchronization, and communication/computation overlap

Strong experience with GPU programming using Triton, HIP, CUDA, or similar parallel programming environments

Strong knowledge of MLIR and/or LLVM internals

Experience profiling, debugging, and optimizing performance across compiler, runtime, and hardware layers

Familiarity with ROCm, HIP, CUDA, or similar GPU programming ecosystems, including performance profiling and optimization tools

Experience optimizing large-scale AI, machine learning or HPC workloads across multi-GPU systems

Experience contributing to open-source projects and working in collaborative, cross-functional engineering environments

Strong problem-solving, communication, and technical leadership skills