Senior Deep Learning Inference Performance Architect

Nvidia
US, NC, Durham / US, CA, Santa Clara2026-01-09onsite

About the job

NVIDIA is seeking a Senior Performance Architect - a creative engineer who loves to squeeze out every cycle of performance from deep learning software. The Inference Architecture team does groundbreaking hardware-software co-design work that focuses on accelerating AI Inference workloads. In this role, you will write performance optimized low level code on today’s GPUs, evaluate and improve state-of-the-art performance techniques in production Large Language Model deployments, and help guide our future GPU architecture decisions. If you are someone who enjoys digging deep into GPU architecture details, are passionate about AI, and know where every cycle goes when you write highly tuned software, this role may be a great fit for you.

Responsibilities

Develop innovative GPU and system architectures to extend the state of the art in AI Inference performance and efficiency

Model, analyze and prototype key deep learning algorithms and applications

Understand and analyze the interplay of hardware and software architectures on future algorithms and applications

Write efficient software for AI Inference, including CUDA kernels, framework level code, and application level code

Collaborate across the company to guide the direction of AI, working with software, research and product teams

Qualifications

Minimum

A MS or PhD in a relevant discipline (CS, EE, Math) or equivalent experience, with 5+ years or relevant experience

Strong mathematical foundation in machine learning and deep learning

Expert programming skills in C, C++, and Python

Familiarity with GPU computing (CUDA or similar) and HPC (MPI, OpenMP)

Strong knowledge and coursework in computer architecture

Preferred

Background with systems-level performance modeling, profiling, and analysis

Experience in characterizing and modeling system-level performance, executing comparison studies, and documenting and publishing results

Experience in optimizing AI Inference workloads with CUDA kernel development