🤖 AI Summary
To address the low training efficiency and poor generalization of large language models (LLMs) on ultra-long contexts (up to 4M tokens), this work proposes a synergistic framework combining phased efficient continued pretraining and multi-granularity instruction tuning. Methodologically, it integrates dynamic position interpolation, synthetic long-sequence data generation, progressive context scaling, and efficient continued pretraining—enabling, for the first time, stable context extension of the Llama3.1-Instruct base model from 128K to 4M tokens. Experiments demonstrate that UltraLong-8B achieves state-of-the-art performance on long-context benchmarks including NarrativeQA, Passkey, and StreamingQA, while preserving competitive accuracy on short-context tasks (e.g., MMLU, BBH) with no significant degradation. All model weights are publicly released. This work establishes a reproducible, scalable training paradigm for ultra-long-context modeling.
📝 Abstract
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.