Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant accuracy degradation in large language models (LLMs) and vision-language models (VLMs) caused by NVFP4 quantization. To mitigate this issue, the authors propose Quantization-Aware Distillation (QAD), a method that leverages a full-precision teacher model to distill knowledge into a quantized student model via KL divergence loss, effectively recovering performance. Notably, QAD operates without requiring the full training dataset and is applicable across multiple post-training stages—including supervised fine-tuning, reinforcement learning, and model merging—thereby substantially reducing the engineering complexity and instability associated with conventional quantization-aware training. Experiments on the Nemotron series and Llama Nemotron models demonstrate that QAD achieves NVFP4 inference accuracy nearly matching that of BF16, underscoring its generality and effectiveness.

Technology Category

Application Category

📝 Abstract
This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.
Problem

Research questions and friction points this paper is trying to address.

quantization
accuracy recovery
large language models
vision-language models
NVFP4
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Distillation
NVFP4
LLM quantization
post-training quantization
knowledge distillation
🔎 Similar Papers
No similar papers found.
M
Meng Xin
NVIDIA
S
Sweta Priyadarshi
NVIDIA
J
Jingyu Xin
NVIDIA
Bilal Kartal
Bilal Kartal
NVIDIA
AIDeep LearningReinforcement LearningMulti-Agent Systems
Aditya Vavre
Aditya Vavre
University of Texas at Austin
Natural Language Processing
A
Asma Kuriparambil Thekkumpate
NVIDIA
Zijia Chen
Zijia Chen
Senior Deep Learning Scientist, NVIDIA Corporation
Natural Language ProcessingArtificial IntelligenceMultimodal Model
Ameya Sunil Mahabaleshwarkar
Ameya Sunil Mahabaleshwarkar
Deep Learning Scientist, NVIDIA
Deep LearningNatural Language ProcessingLarge Language ModelsSmall Language Models
I
Ido Shahaf
NVIDIA
Akhiad Bercovich
Akhiad Bercovich
PhD candidate, Weizmann Institute of Science
Single Cell GenomicsEpigenomicsMachine LearningDNA language/regulation modelsefficient LLMs
K
Kinjal Patel
NVIDIA
S
Suguna Varshini Velury
NVIDIA
C
Chenjie Luo
NVIDIA
Z
Zhiyu Cheng
NVIDIA
J
Jenny Chen
NVIDIA
C
Chen-Han Yu
NVIDIA
Wei Ping
Wei Ping
Distinguished Research Scientist, NVIDIA
machine learninglarge language modelsspeech synthesisreinforcement learning
O
Oleg Rybakov
NVIDIA
Nima Tajbakhsh
Nima Tajbakhsh
Nvidia Inc.
Computer vision and Artificial Intelligence
O
Oluwatobi Olabiyi
NVIDIA
D
Dusan Stosic
NVIDIA
D
Di Wu
NVIDIA
S
Song Han
NVIDIA
E
Eric Chung
NVIDIA
S
Sharath Turuvekere Sreenivas
NVIDIA
Bryan Catanzaro
Bryan Catanzaro
NVIDIA
Parallel ComputingMachine Learning
Yoshi Suhara
Yoshi Suhara
NVIDIA
Natural Language ProcessingMachine LearningComputational Social Science
Tijmen Blankevoort
Tijmen Blankevoort
Meta - GenAI Llama Foundation Models
Machine LearningDeep Learning