Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Addressing the challenge of jointly achieving sparsity, flexible random access, and strong length generalization in ultra-long-context modeling, this paper introduces Hierarchical Sparse Attention (HSA), the first mechanism enabling efficient training and inference on 16-million-token contexts within an 8B-parameter Mixture-of-Experts (MoE) architecture. HSA employs a multi-granularity sparse attention pattern, preserving full Transformer compatibility while substantially improving length extrapolation capability. Trained on a trillion-token corpus, the model supports both in-domain and out-of-domain ultra-long-context evaluation: it matches the performance of dense-attention baselines on in-domain tasks and achieves >90% accuracy across most tasks in a 16M-token retrieval benchmark. This work establishes a new paradigm for scalable “memory machines,” balancing computational efficiency, access flexibility, and robust generalization to arbitrary context lengths.

Technology Category

Application Category

📝 Abstract

This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: extbf{sparsity}, extbf{random-access flexibility}, and extbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.

Problem

Research questions and friction points this paper is trying to address.

Generalizing ultra-long context modeling in LLMs

Achieving sparsity, random-access flexibility, length generalization

Handling contexts up to 16M tokens efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Sparse Attention mechanism for efficiency

8B-parameter MoE model trained on trillion tokens

Handles 16M context length with high accuracy

🔎 Similar Papers

No similar papers found.