🤖 AI Summary
This work addresses the significant accuracy degradation in large language models (LLMs) and vision-language models (VLMs) caused by NVFP4 quantization. To mitigate this issue, the authors propose Quantization-Aware Distillation (QAD), a method that leverages a full-precision teacher model to distill knowledge into a quantized student model via KL divergence loss, effectively recovering performance. Notably, QAD operates without requiring the full training dataset and is applicable across multiple post-training stages—including supervised fine-tuning, reinforcement learning, and model merging—thereby substantially reducing the engineering complexity and instability associated with conventional quantization-aware training. Experiments on the Nemotron series and Llama Nemotron models demonstrate that QAD achieves NVFP4 inference accuracy nearly matching that of BF16, underscoring its generality and effectiveness.
📝 Abstract
This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.