Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high I/O and computational overhead in long-context reasoning, this paper introduces the Ring-linear model family. It proposes a dynamically hybrid architecture combining linear and Softmax attention mechanisms, achieving— for the first time—the decoupled alignment of both mechanisms during both training and inference. We innovatively optimize the mixture-ratio structure and leverage Linghe, our custom high-performance FP8 operator library, to significantly improve hardware utilization. Compared to a 32B dense baseline, our method reduces inference cost by 90%; relative to the original Ring series, it cuts costs by over 50%, while maintaining state-of-the-art performance across multiple complex reasoning benchmarks. The design balances efficiency, stability, and scalability, establishing a novel paradigm for deploying large language models with extended context windows.

Technology Category

Application Category

📝 Abstract
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Efficient hybrid architecture for long-context reasoning
Reducing computational overhead in long-context inference
Optimizing attention mechanisms ratio for performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture combines linear and softmax attention
Reduces inference cost to 1/10 of dense models
Optimizes attention ratio for efficient long-context reasoning
L
Ling Team
B
Bin Han
C
Caizhi Tang
C
Chen Liang
D
Donghao Zhang
F
Fan Yuan
F
Feng Zhu
J
Jie Gao
J
Jingyu Hu
L
Longfei Li
M
Meng Li
Mingyang Zhang
Mingyang Zhang
School of electronic engineering, Xidian University
Computational IntelligenceRemote SensingImage Processing
P
Peijie Jiang
P
Peng Jiao
Q
Qian Zhao
Q
Qingyuan Yang
Wenbo Shen
Wenbo Shen
Zhejiang University
Kernel SecurityContainer SecuritySystem Security
Xinxing Yang
Xinxing Yang
Y
Yalin Zhang
Y
Yankun Ren
Y
Yao Zhao
Y
Yibo Cao
Yixuan Sun
Yixuan Sun
Fudan University
Y
Yue Zhang
Y
Yuchen Fang