MoBA: Mixture of Block Attention for Long-Context LLMs

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context large language models suffer from quadratic attention complexity (O(n²)) and compromised complex reasoning performance due to strong structural biases imposed by conventional sparse or linear attention methods. To address this, we propose Hybrid Block Attention (HBA), the first attention mechanism to incorporate Mixture-of-Experts (MoE) principles: it decomposes attention at the block level, employs dynamic gating for expert routing, and supports plug-and-play sparsification—enabling adaptive, task-driven switching between full and sparse attention without predefined sparsity patterns. Crucially, HBA eliminates fixed structural priors, allowing the model to autonomously learn optimal sparse structures per task. Evaluated in Kimi’s production environment, HBA efficiently handles multi-thousand-token contexts with zero accuracy degradation while significantly improving computational efficiency. This establishes a new, bias-free attention paradigm that is both highly efficient and flexible for long-text understanding and multi-step reasoning.

Technology Category

Application Category

📝 Abstract
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
Problem

Research questions and friction points this paper is trying to address.

Scaling context length for LLMs
Reducing quadratic computational complexity
Enhancing efficiency without performance loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Block Attention
Seamless transition capability
Efficient attention computation
🔎 Similar Papers
E
Enzhe Lu
Moonshot AI
Z
Zhejun Jiang
Moonshot AI
J
Jingyuan Liu
Moonshot AI
Yulun Du
Yulun Du
Carnegie Mellon University
Deep LearningNatural Language ProcessingHuman-AI Interaction
T
Tao Jiang
Moonshot AI
C
Chao Hong
Moonshot AI
Shaowei Liu
Shaowei Liu
University of Illinois Urbana-Champaign
Computer VisionRobotics
Weiran He
Weiran He
Unknown affiliation
E
Enming Yuan
Moonshot AI
Yuzhi Wang
Yuzhi Wang
Research Engineer @ Megvii Inc.
Computer VisionArtificial IntelligenceWireless Sensor Network
Z
Zhiqi Huang
Moonshot AI
Huan Yuan
Huan Yuan
Unknown affiliation
S
Suting Xu
Moonshot AI
X
Xinran Xu
Moonshot AI
Guokun Lai
Guokun Lai
Inflection AI
machine learning
Y
Yanru Chen
Moonshot AI
H
Huabin Zheng
Moonshot AI
J
Junjie Yan
Moonshot AI
Jianlin Su
Jianlin Su
Moonshot AI
Y
Yuxin Wu
Moonshot AI
N
Neo Y. Zhang
Moonshot AI
Zhilin Yang
Zhilin Yang
Carnegie Mellon University
Deep LearningMachine LearningNatural Language Processing
X
Xinyu Zhou
Moonshot AI
M
Mingxing Zhang
Tsinghua University
Jiezhong Qiu
Jiezhong Qiu
Zhejiang University - Zhejiang Lab Hundred Talents Program Researcher
Data MiningSocial Network AnalysisNetwork EmbeddingGraph Neural Networks