Short Data, Long Context: Distilling Positional Knowledge in Transformers

πŸ“… 2026-04-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Extending the context window of language models typically relies on costly long-context pretraining, which suffers from low training efficiency and data scarcity. This work proposes a logit-based knowledge distillation approach that enables a student model trained exclusively on packed short-context samples to inherit the teacher model’s long-context retrieval capability. Key findings include the effective transfer of positional information through logit distillation, the emergence of structured update patterns in query states during long-context inference, and the significant improvement in extrapolation performance achieved by phased RoPE scaling. Experimental results demonstrate that the proposed method substantially enhances the student model’s long-context comprehension without requiring any long-sequence training data.
πŸ“ Abstract
Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.
Problem

Research questions and friction points this paper is trying to address.

long-context
knowledge distillation
positional information
transformer
RoPE
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation
long-context modeling
Rotary Position Embedding
positional information transfer
short-context training
πŸ”Ž Similar Papers
No similar papers found.