LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
Existing quantization methods suffer significant accuracy degradation in long-sequence inference due to KV cache distortion and distribution shift. This work proposes LAQuant, a layer-wise weight quantization approach that incurs no online overhead and uniquely integrates inference-domain calibration with single-layer lookahead loss to enable cross-layer co-adaptation while preserving the residual stream of the subsequent layer. By aligning Hessian subspaces and optimizing KV cache fidelity, LAQuant achieves a 15.11 percentage point improvement over ParoQuant in AIME25 Pass@1 under W3G128 settings on Qwen3-4B, while attaining a decoding speed 3.42× faster than FP16.
📝 Abstract
Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant's 3.01x.
Problem

Research questions and friction points this paper is trying to address.

Large Reasoning Models
Weight Quantization
Long Autoregressive Decoding
Accuracy Degradation
KV-cache Fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LookAhead Quantization
layer-wise quantization
KV-cache fidelity
Hessian-subspace alignment
weight-only QAT
🔎 Similar Papers
No similar papers found.