Product Manager, AI Platform Kernels and Communication Libraries

About the job

NVIDIA's AI Software Platforms team seeks a technical product manager to accelerate next-generation inference deployments through innovative libraries, communication runtimes, and kernel optimization frameworks. This role bridges low-level GPU programming with ecosystem-wide developer enablement for products including CUTLASS, cuDNN, NCCL, NVSHMEM, and open-source contributions to Triton/FlashInfer.

Responsibilities

Architect developer-focused products that simplify high-performance inference and training deployment across diverse GPU architectures.

Define the multi-year strategy for kernel and communication libraries by analyzing performance bottlenecks in emerging AI workloads.

Collaborate with CUDA kernel engineers to design intuitive, high-level abstractions for memory and distributed execution.

Partner with open-source communities like Triton and FlashInfer to shape and drive ecosystem-wide roadmaps.

Qualifications

Minimum

7+ years of technical PM experience shipping developer products for GPU acceleration, with expertise in HPC optimization stacks.

Expert-level understanding of CUDA execution models and multi-GPU protocols, with a proven track record to translate hardware capabilities into software roadmaps.

BS or MS or equivalent experience in Computer Engineering or demonstrated expertise in parallel computing architectures.

Strong technical interpersonal skills with experience communicating complex optimizations to developers and researchers.

Preferred

PhD or equivalent experience in Computer Engineering or a related technical field.

Contributed to performance-critical open-source projects like Triton, FlashAttention, or TVM with measurable adoption impact

Crafted GitHub-first developer tools with >1k stars or similar community engagement metrics

Published research on GPU kernel optimization, collective communication algorithms, or ML model serving architectures

Experience building cost-per-inference models incorporating hardware utilization, energy efficiency, and cluster scaling factors