Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the significant accuracy degradation in large language models (LLMs) and vision-language models (VLMs) caused by NVFP4 quantization. To mitigate this issue, the authors propose Quantization-Aware Distillation (QAD), a method that leverages a full-precision teacher model to distill knowledge into a quantized student model via KL divergence loss, effectively recovering performance. Notably, QAD operates without requiring the full training dataset and is applicable across multiple post-training stages—including supervised fine-tuning, reinforcement learning, and model merging—thereby substantially reducing the engineering complexity and instability associated with conventional quantization-aware training. Experiments on the Nemotron series and Llama Nemotron models demonstrate that QAD achieves NVFP4 inference accuracy nearly matching that of BF16, underscoring its generality and effectiveness.

Technology Category

Application Category

📝 Abstract

This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

Problem

Research questions and friction points this paper is trying to address.

quantization

accuracy recovery

large language models

vision-language models

NVFP4

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Distillation

NVFP4

LLM quantization