🤖 AI Summary
Coupling RoPE position interpolation (PI) with post-training quantization (PTQ) induces severe degradation in long-context accuracy due to aliasing, dynamic range inflation, anisotropic interactions between quantizers and rotation matrices, and position-dependent logit noise. Method: We first systematically characterize this coupling mechanism and propose two diagnostic metrics—“interpolation pressure” and “tail inflation ratio.” Based on these, we design Q-ROAR, a lightweight interpolation-aware rescaling method: it groups RoPE dimensions by frequency bands and applies channel-wise, symmetric rescaling to Key/Query weights, with optimal configuration efficiently searched via a small long-context development set. Contribution/Results: Q-ROAR requires no fine-tuning, incurs zero deployment overhead, and integrates seamlessly into existing inference stacks. On long-context tasks, it reduces perplexity by over 14% while preserving short-context accuracy and inference throughput.
📝 Abstract
Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.