Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-context extension typically relies heavily on computationally expensive fine-tuning. Method: This paper proposes DroPE—a post-pretraining method that removes positional encodings entirely and achieves zero-shot context-length extrapolation via lightweight short-sequence calibration. Contribution/Results: We theoretically and empirically demonstrate, for the first time, that positional encodings are not strictly necessary for language modeling; instead, they introduce a dual role—inducing beneficial inductive bias for pretraining convergence while simultaneously impeding long-range generalization. DroPE employs a dynamic dropping mechanism and parameter-efficient calibration to extend context length zero-shot without degrading original short-context performance. Extensive experiments across multiple model architectures and data scales show that DroPE significantly outperforms RoPE scaling and specialized long-context models, effectively breaking the “fine-tuning dependency” bottleneck.

Technology Category

Application Category

📝 Abstract
So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.
Problem

Research questions and friction points this paper is trying to address.

Extends LLM context without costly finetuning
Removes positional embeddings after training for generalization
Enables zero-shot adaptation to longer sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dropping positional embeddings after pretraining
Enables zero-shot context extension without finetuning
Quickly adapts models without compromising original capabilities
🔎 Similar Papers
No similar papers found.