🤖 AI Summary
This work addresses the excessive sensitivity of large language models to the absolute positions of critical information in long-context scenarios, which leads to high positional variance and poor robustness in reasoning. To mitigate this issue, the authors propose RoPE-Perturbed Self-Distillation, a novel approach that combines Rotary Position Embedding (RoPE) perturbation with self-distillation for the first time. By generating multiple views of the same input sequence and enforcing consistency across model outputs, the method reduces reliance on absolute positional cues and enhances semantic comprehension. Experiments on Llama-3-8B and Qwen-3-4B demonstrate significant improvements of 12.04% and 2.71% on RULER-64K and RULER-256K benchmarks, respectively, showcasing markedly enhanced long-context adaptability and length extrapolation capabilities.
📝 Abstract
Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty.
We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.