About the job
We are now looking for a Senior Software Engineer for Quantized Inference! NVIDIA is seeking software engineers to accelerate the discovery and deployment of efficient inference recipes for LLMs. A recipe defines which operators are transformed into low-precision or sparsified variants — unlocking throughput and latency gains without regressing accuracy or verbosity. Recipes may incorporate techniques such as rotations, block scaling to attenuate outlier impact, or improved calibration data drawn from SFT/RL pipelines.
Responsibilities
Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang)
Own model export pipelines (ModelOpt, Megatron-LM HuggingFace), ensuring quantized checkpoints serialize correctly for downstream serving
Build prototypes and benchmarking harnesses to evaluate recipe throughput/interactivity before full optimization
Develop data analysis tooling and visualizations for numerics debugging
Improve developer productivity across the team: CI, build systems, training infrastructure, pipeline friction
Participate in code reviews and incorporate feedback
Qualifications
Minimum
Proficient in Python; familiarity with C++
Strong software engineering fundamentals: concise, well-tested code; fluent with AI-assisted tooling
Experience with ML accelerators with a basic understanding of how certain ML layers affect execution time
Familiarity with PyTorch internals (custom ops, autograd, export) or equivalent framework
Experience reading, modifying, or contributing to a large open-source codebase
MS/PhD in Computer Science or related field, or equivalent experience.
4+ years in a relevant software engineering role
Demonstrated ability to move fast with ambiguous requirements, with strong written and verbal communication
Preferred
Experience contributing to inference serving frameworks (vLLM, TRT-LLM, SGLang) or Triton kernel development
Track record of debugging numerical issues across mixed-precision boundaries
Deep experience with model compression techniques: PTQ, QAT, structured/unstructured sparsity