Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the prohibitive GPU memory cost of training large language models on ultra-long contexts, where activation memory scales linearly with sequence length. The authors propose a chunked recursive training framework that integrates on-the-fly activation recomputation, paged memory management, asynchronous CPU offloading, and page-level sparse attention. This approach decouples activation memory from context length for the first time, achieving an overhead of only 10 MB per 10,000 tokens and enabling million-token-scale training. The method drastically reduces hardware requirements, successfully training the Qwen2.5-7B model with a 4-million-token context on a single H200 GPUβ€”far exceeding the scale achievable by existing methods without large-scale clusters.

Technology Category

Application Category

πŸ“ Abstract
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.
Problem

Research questions and friction points this paper is trying to address.

LLM training
long-context
GPU memory
activation memory
memory efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-efficient training
long-context LLMs
activation recomputation
paged KV cache
sparse attention
πŸ”Ž Similar Papers
No similar papers found.
Wenhao Li
Wenhao Li
Xiamen University
LMSysEfficient LLM
D
Daohai Yu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Gen Luo
Gen Luo
Shanghai AI Laboratory
computer visionvision and language
Yuxin Zhang
Yuxin Zhang
Xiamen University
Network sparsityModel compression
F
Fei Chao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Yifan Wu
Yifan Wu
Peking University
AIOpsLog AnalysisSoftware Engineering
J
Jiaxin Liu
University of Illinois Urbana-Champaign
Ziyang Gong
Ziyang Gong
SJTU, THU, Shanghai AI Lab (OpenGVLab), SYSU
Embodied Spatial Intelligence
Zimu Liao
Zimu Liao
Shanghai AI Lab & SJTU
High Performance ComputingComputer Graphics3D VisionParallel Programming