RWKV-7"Goose"with Expressive Dynamic State Evolution

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two fundamental limitations in sequence modeling: the difficulty of simultaneously achieving parallel training and efficient state tracking, and the theoretical restriction of Transformers to the complexity class $mathsf{TC}^0$—rendering them incapable of recognizing all regular languages. To overcome these, we propose RWKV-7 “Goose”, a novel architecture featuring: (1) a generalized delta rule with vector-valued gating and context-aware learning rates, coupled with a relaxed value replacement mechanism; (2) recurrent state evolution with constant memory and inference overhead; and (3) the first formal proof that the architecture can recognize all regular languages—surpassing Transformer’s $mathsf{TC}^0$ bound. Trained on a proprietary 3.1-trillion-token multilingual corpus, the 3B-parameter model achieves new SOTA in multilingual understanding and generation, with significantly reduced training token requirements. We open-source four models (0.19B–2.9B parameters) and full code under Apache 2.0, available on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract
We present RWKV-7"Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
Problem

Research questions and friction points this paper is trying to address.

Introduces RWKV-7, a new sequence modeling architecture.
Achieves state-of-the-art multilingual task performance with fewer tokens.
Enables constant memory usage and inference time per token.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constant memory and inference time per token
Generalized delta rule with vector-valued gating
State tracking and regular language recognition capability
🔎 Similar Papers
No similar papers found.
B
Bo Peng
RWKV Project (under Linux Foundation AI & Data), EleutherAI
R
Ruichong Zhang
Tsinghua University
D
Daniel Goldstein
EleutherAI, Recursal AI
E
Eric Alcaide
EleutherAI, Dalle Molle Institute for Artificial Intelligence USI-SUPSI
Haowen Hou
Haowen Hou
Assistant Professor, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
RWKVLLMVLMInformation Retrieval
J
Janna Lu
Recursal AI, George Mason University
William Merrill
William Merrill
Ai2 / TTIC
language modelsformal languagescomputational linguisticsdeep learning
G
Guangyu Song
EleutherAI, Tano Labs
K
Kaifeng Tan
Shenzhen University
S
Saiteja Utpala
EleutherAI
N
Nathan Wilce
EleutherAI, Recursal AI
J
J. S. Wind
University of Oslo
T
Tianyi Wu
Beijing Normal University
D
Daniel Wuttke
EleutherAI
C
Christian Zhou-Zheng
EleutherAI