Cubit: Token Mixer with Kernel Ridge Regression

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the limitations of conventional Transformers, whose token mixing relies on attention mechanisms lacking a rigorous mathematical foundation, thereby hindering long-sequence modeling. The authors reinterpret attention as Nadaraya–Watson kernel regression and propose Cubit, a novel token-mixing architecture grounded in kernel ridge regression (KRR). Cubit is the first to incorporate the closed-form solution of KRR into sequence modeling, enabling principled value aggregation and kernel matrix normalization. To ensure training stability, the authors introduce a Limited-Range Rescale technique. Empirical results demonstrate that Cubit significantly outperforms standard Transformers on long-sequence tasks, with its performance advantage amplifying as sequence length increases.

📝 Abstract

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

Problem

Research questions and friction points this paper is trying to address.

Transformer

token mixing

attention mechanism

long-sequence modeling

regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel Ridge Regression

Token Mixer

Transformer Architecture