Senior Software Engineer, Quantized Inference

About the job

We are now looking for a Senior Software Engineer for Quantized Inference! NVIDIA is seeking software engineers to accelerate the discovery and deployment of efficient inference recipes for LLMs. A recipe defines which operators are transformed into low-precision or sparsified variants — unlocking throughput and latency gains without regressing accuracy or verbosity. Recipes may incorporate techniques such as rotations, block scaling to attenuate outlier impact, or improved calibration data drawn from SFT/RL pipelines.

Responsibilities

Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang)

Own model export pipelines (ModelOpt, Megatron-LM HuggingFace), ensuring quantized checkpoints serialize correctly for downstream serving

Build prototypes and benchmarking harnesses to evaluate recipe throughput/interactivity before full optimization

Develop data analysis tooling and visualizations for numerics debugging

Improve developer productivity across the team: CI, build systems, training infrastructure, pipeline friction

Participate in code reviews and incorporate feedback

Qualifications

Minimum

Proficient in Python; familiarity with C++

Strong software engineering fundamentals: concise, well-tested code; fluent with AI-assisted tooling

Experience with ML accelerators with a basic understanding of how certain ML layers affect execution time

Familiarity with PyTorch internals (custom ops, autograd, export) or equivalent framework

Experience reading, modifying, or contributing to a large open-source codebase

MS/PhD in Computer Science or related field, or equivalent experience.

4+ years in a relevant software engineering role

Demonstrated ability to move fast with ambiguous requirements, with strong written and verbal communication

Preferred

Experience contributing to inference serving frameworks (vLLM, TRT-LLM, SGLang) or Triton kernel development

Track record of debugging numerical issues across mixed-precision boundaries

Deep experience with model compression techniques: PTQ, QAT, structured/unstructured sparsity