Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Coupling RoPE position interpolation (PI) with post-training quantization (PTQ) induces severe degradation in long-context accuracy due to aliasing, dynamic range inflation, anisotropic interactions between quantizers and rotation matrices, and position-dependent logit noise. Method: We first systematically characterize this coupling mechanism and propose two diagnostic metrics—“interpolation pressure” and “tail inflation ratio.” Based on these, we design Q-ROAR, a lightweight interpolation-aware rescaling method: it groups RoPE dimensions by frequency bands and applies channel-wise, symmetric rescaling to Key/Query weights, with optimal configuration efficiently searched via a small long-context development set. Contribution/Results: Q-ROAR requires no fine-tuning, incurs zero deployment overhead, and integrates seamlessly into existing inference stacks. On long-context tasks, it reduces perplexity by over 14% while preserving short-context accuracy and inference throughput.

Technology Category

Application Category

📝 Abstract

Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Problem

Research questions and friction points this paper is trying to address.

Combining RoPE interpolation with quantization degrades LLM accuracy

Analyzing interpolation pressure and outlier shifts in quantized models

Proposing weight rescaling method to stabilize long-context quantized LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Groups RoPE dimensions into frequency bands

Performs lightweight search over per-band scales

Uses diagnostics to guide interpolation-aware stabilization

🔎 Similar Papers

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

2023-09-27International Conference on Learning RepresentationsCitations: 7

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow