FAAR: Format-Aware Adaptive Rounding for NVFP4

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the significant performance degradation of existing quantization methods when applied to non-uniform low-bit formats such as NVFP4, which stems from their neglect of the format’s intrinsic numerical grid structure, leading to large rounding errors. To overcome this limitation, we propose Format-Aware Adaptive Rounding (FAAR), a novel strategy that explicitly incorporates NVFP4’s non-uniform quantization grid into the rounding optimization process for the first time. This is complemented by a lightweight two-stage Format Alignment (2FA) fine-tuning mechanism, enabling efficient approximation of the theoretically optimal quantization. Evaluated on Llama3-1B and Qwen3-1.7B, our approach achieves WikiText-2 perplexities of 12.60 and 21.27, respectively, and consistently outperforms prior methods across zero-shot tasks, with only approximately 4 additional GPU-hours of training overhead.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

quantization

NVFP4

rounding error

non-uniform grid

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Format-Aware Adaptive Rounding

NVFP4

non-uniform quantization

learnable rounding

low-bit LLM deployment

🔎 Similar Papers

FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear Quantization

2024-03-042024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)Citations: 5

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Research Intern - Training Methods for LLM Efficiency

Microsoft

$6,710 -

United States, California, Mountain View

Staff Machine Learning Engineer – Model Optimization & Quantization

Qualcomm

$158,400.00 - $237,600.00

San Diego, California, United States of America / Santa Clara, California, United States of America

Senior Machine Learning Engineer, Quantized Inference