Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

πŸ“… 2025-11-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the challenge of jointly achieving sparsity, flexible random access, and strong length generalization in ultra-long-context modeling, this paper introduces Hierarchical Sparse Attention (HSA), the first mechanism enabling efficient training and inference on 16-million-token contexts within an 8B-parameter Mixture-of-Experts (MoE) architecture. HSA employs a multi-granularity sparse attention pattern, preserving full Transformer compatibility while substantially improving length extrapolation capability. Trained on a trillion-token corpus, the model supports both in-domain and out-of-domain ultra-long-context evaluation: it matches the performance of dense-attention baselines on in-domain tasks and achieves >90% accuracy across most tasks in a 16M-token retrieval benchmark. This work establishes a new paradigm for scalable β€œmemory machines,” balancing computational efficiency, access flexibility, and robust generalization to arbitrary context lengths.

Technology Category

Application Category

πŸ“ Abstract
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: extbf{sparsity}, extbf{random-access flexibility}, and extbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.
Problem

Research questions and friction points this paper is trying to address.

Generalizing ultra-long context modeling in LLMs
Achieving sparsity, random-access flexibility, length generalization
Handling contexts up to 16M tokens efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Sparse Attention mechanism for efficiency
8B-parameter MoE model trained on trillion tokens
Handles 16M context length with high accuracy
πŸ”Ž Similar Papers
No similar papers found.