๐ค AI Summary
Standard Transformers struggle to generalize to regular languages and NCยน-complete problems due to their lack of explicit state-tracking mechanisms. This work proposes Rational Transducers, which for the first time integrate matrix recurrence derived from weighted finite automata into the Transformer architecture. By injecting rational state information directly into the attention mechanism through deep rational injection, the model strictly extends its expressive power to encompass all regular languages and NCยน-complete problems while preserving efficient parallel computation. Empirical evaluations on tasks such as Parity and modular counting demonstrate robust length generalization, effectively overcoming both the sequential bottleneck of conventional RNNs and the representational limitations of standard Transformers.
๐ Abstract
Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to $\AC^0$ (under hard attention) or $\TC^0$ (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought. In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emph{Deep Rational Injection} scheme, our framework strictly generalizes the expressive power of Transformers to capture all Regular Languages, $\NC^1$-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving $O(L + \log T)$ parallel time complexity. We ground the architecture in a rigorous learning theory: we prove that \emph{Random Rational Features} act as a universal basis for sequential dependencies, justifying our initialization strategy, while establishing that the \emph{Differentiable Rational Feature} regime is necessary to close the representational compactness gap. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the"Regular Gap,"enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.