Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Position interpolation (PI) in RoPE coupled with post-training quantization (PTQ) induces position-dependent logit noise—arising from long-context aliasing, dynamic range expansion, axial anisotropy, and outlier shifts—degrading long-context performance. Method: We systematically characterize this coupling effect and propose a fine-tuning-free, architecture-agnostic stabilization framework. We introduce two diagnostic metrics—Interpolation Pressure Ratio and Tail Inflation Ratio—and design a RoPE-aware band-wise rescaling mechanism: grouping dimensions by RoPE frequency and independently optimizing per-group Q/K weight scaling factors under symmetric scaling to preserve logit magnitude. The method requires only a small long-context development set and integrates seamlessly into existing inference pipelines. Results: Experiments show up to 0.7% accuracy recovery on standard benchmarks, >10% perplexity reduction on GovReport, and zero degradation on short-context tasks—demonstrating robustness and practicality.

Technology Category

Application Category

📝 Abstract
Extending LLM context windows is crucial for long range tasks. RoPE-based position interpolation (PI) methods like linear and frequency-aware scaling extend input lengths without retraining, while post-training quantization (PTQ) enables practical deployment. We show that combining PI with PTQ degrades accuracy due to coupled effects long context aliasing, dynamic range dilation, axis grid anisotropy, and outlier shifting that induce position-dependent logit noise. We provide the first systematic analysis of PI plus PTQ and introduce two diagnostics: Interpolation Pressure (per-band phase scaling sensitivity) and Tail Inflation Ratios (outlier shift from short to long contexts). To address this, we propose Q-ROAR, a RoPE-aware, weight-only stabilization that groups RoPE dimensions into a few frequency bands and performs a small search over per-band scales for W_Q,W_K, with an optional symmetric variant to preserve logit scale. The diagnostics guided search uses a tiny long-context dev set and requires no fine-tuning, kernel, or architecture changes. Empirically, Q-ROAR recovers up to 0.7% accuracy on standard tasks and reduces GovReport perplexity by more than 10%, while preserving short-context performance and compatibility with existing inference stacks.
Problem

Research questions and friction points this paper is trying to address.

Combining position interpolation with quantization degrades LLM accuracy
Addresses long-context aliasing and outlier shifting in quantized models
Solves position-dependent logit noise in extended context windows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-band frequency scaling for RoPE dimensions
Diagnostics-guided search without fine-tuning
Weight-only stabilization preserving logit scale
🔎 Similar Papers
No similar papers found.