Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address high I/O and computational overhead in long-context reasoning, this paper introduces the Ring-linear model family. It proposes a dynamically hybrid architecture combining linear and Softmax attention mechanisms, achieving— for the first time—the decoupled alignment of both mechanisms during both training and inference. We innovatively optimize the mixture-ratio structure and leverage Linghe, our custom high-performance FP8 operator library, to significantly improve hardware utilization. Compared to a 32B dense baseline, our method reduces inference cost by 90%; relative to the original Ring series, it cuts costs by over 50%, while maintaining state-of-the-art performance across multiple complex reasoning benchmarks. The design balances efficiency, stability, and scalability, establishing a novel paradigm for deploying large language models with extended context windows.

Technology Category

Application Category

📝 Abstract

In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Efficient hybrid architecture for long-context reasoning

Reducing computational overhead in long-context inference

Optimizing attention mechanisms ratio for performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture combines linear and softmax attention

Reduces inference cost to 1/10 of dense models

Optimizes attention ratio for efficient long-context reasoning

🔎 Similar Papers

GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning