RWKV-X: A Linear Complexity Hybrid Language Model

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that language models struggle to simultaneously achieve efficient short-range modeling and effective long-range dependency capture, while suffering from high training and inference complexity. We propose RWKV-X, a linear-complexity hybrid architecture. Its core innovation lies in coupling RWKV’s state-space modeling—ensuring O(1) per-token decoding—with a custom sparse attention mechanism to enhance long-range contextual modeling, thereby achieving, for the first time, O(N) training and O(1) per-step decoding complexity. Through continual pretraining on 64K-length sequences and memory–computation co-optimization, RWKV-X maintains stability over million-token decoding and achieves near-100% accuracy on the 64K-context passkey retrieval task. Experiments demonstrate that RWKV-X significantly outperforms RWKV-7 on long-context benchmarks while preserving strong short-context performance. Code and checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce extbf{RWKV-X}, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.
Problem

Research questions and friction points this paper is trying to address.

Develops RWKV-X for efficient short and long-range context modeling
Achieves linear-time training and constant-time inference complexity
Enables scalable decoding up to 1 million tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture combining RWKV and sparse attention
Linear-time complexity in training and decoding
Scalable for sequences up to 1 million tokens
🔎 Similar Papers
No similar papers found.
Haowen Hou
Haowen Hou
Assistant Professor, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
RWKVLLMVLMInformation Retrieval
Z
Zhiyi Huang
College of Information Science and Engineering, Hohai University, Nanjing, China
K
Kaifeng Tan
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
R
Rongchang Lu
School of Ecological and Environmental Engineering, Qinghai University, Xining, China
F
Fei Richard Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China