Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Large language models (LLMs) suffer from limited long-context reasoning capability due to quadratic computational complexity in attention and the scarcity—and high cost—of human-annotated long-text data. Method: This paper proposes a hierarchical synthetic data generation framework that decouples context length from architectural constraints, integrating RoPE progressive scaling with long-context alignment training to overcome the million-token instruction-tuning bottleneck. Contribution/Results: We introduce the first open-source instruction-tuning dataset supporting up to 1M tokens context. Evaluated on RULER and InfiniteBench, our approach achieves significant gains in long-context understanding while preserving robustness on standard general-purpose benchmarks. This work establishes a foundational, systematic, and open methodology for advancing LLMs’ long-context capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with long-context reasoning due to computational complexity

Scarcity of annotated long-context data limits model performance

Lack of open-source long-context datasets exceeding 100K tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical synthetic data generation for long-context

Scalable RoPE training strategy for 1M tokens

Efficient post-training without real-world data constraints

🔎 Similar Papers

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs