🤖 AI Summary
This work addresses two fundamental limitations in sequence modeling: the difficulty of simultaneously achieving parallel training and efficient state tracking, and the theoretical restriction of Transformers to the complexity class $mathsf{TC}^0$—rendering them incapable of recognizing all regular languages. To overcome these, we propose RWKV-7 “Goose”, a novel architecture featuring: (1) a generalized delta rule with vector-valued gating and context-aware learning rates, coupled with a relaxed value replacement mechanism; (2) recurrent state evolution with constant memory and inference overhead; and (3) the first formal proof that the architecture can recognize all regular languages—surpassing Transformer’s $mathsf{TC}^0$ bound. Trained on a proprietary 3.1-trillion-token multilingual corpus, the 3B-parameter model achieves new SOTA in multilingual understanding and generation, with significantly reduced training token requirements. We open-source four models (0.19B–2.9B parameters) and full code under Apache 2.0, available on Hugging Face and GitHub.
📝 Abstract
We present RWKV-7"Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.