From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low training efficiency and poor generalization of large language models (LLMs) on ultra-long contexts (up to 4M tokens), this work proposes a synergistic framework combining phased efficient continued pretraining and multi-granularity instruction tuning. Methodologically, it integrates dynamic position interpolation, synthetic long-sequence data generation, progressive context scaling, and efficient continued pretraining—enabling, for the first time, stable context extension of the Llama3.1-Instruct base model from 128K to 4M tokens. Experiments demonstrate that UltraLong-8B achieves state-of-the-art performance on long-context benchmarks including NarrativeQA, Passkey, and StreamingQA, while preserving competitive accuracy on short-context tasks (e.g., MMLU, BBH) with no significant degradation. All model weights are publicly released. This work establishes a reproducible, scalable training paradigm for ultra-long-context modeling.

Technology Category

Application Category

📝 Abstract
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Efficiently training ultra-long context LLMs
Extending context lengths from 128K to 4M tokens
Maintaining performance on both long and short tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient continued pretraining for context extension
Effective instruction tuning for ability maintenance
Scalable framework for ultra-long context models
🔎 Similar Papers
No similar papers found.