LongRoPE2: Near-Lossless LLM Context Window Scaling

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant degradation in short-context performance observed when extending the context window of large language models (LLMs). We propose an efficient, lossless context extension method. Our approach identifies a novel mechanism: insufficient training of high-dimensional Rotary Position Embeddings (RoPE) leads to out-of-distribution (OOD) generalization failure. To rectify this, we design a “needle-driven” perplexity-guided evolutionary search algorithm that automatically optimizes RoPE rescaling parameters. Additionally, we introduce a joint fine-tuning paradigm using mixed-length contexts. Applied to LLaMA3-8B, our method extends the context window to 128K tokens while preserving ≥98.5% of short-context task performance. It requires only 10B training tokens and achieves 80× higher training efficiency compared to Meta’s state-of-the-art method. The contributions include: (1) uncovering a previously unrecognized RoPE-related OOD failure mode; (2) a novel automated optimization framework for positional encoding calibration; and (3) an effective multi-scale context training strategy.

Technology Category

Application Category

📝 Abstract
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by"needle-driven"perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
Problem

Research questions and friction points this paper is trying to address.

Extends LLMs' context window effectively
Preserves performance on original short context
Reduces training tokens for context scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary search-based RoPE rescaling algorithm
Mixed context window training approach
Hypothesis on insufficient RoPE dimension training
🔎 Similar Papers
No similar papers found.