Transformer-like Inference from Optimal Control

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work formulates sequence prediction as an optimal control problem and, for the first time, derives from optimal control principles an explicit inference algorithm—dual filtering—that exhibits a Transformer-like layer structure. For both discrete nonlinear and linear Gaussian systems, the approach naturally yields analytical expressions of attention mechanisms, revealing their deep connection to optimal control solutions. The analysis further indicates that Transformers implicitly exploit non-Markovian structures through low-dimensional embeddings. Numerical experiments demonstrate a high degree of alignment between the derived dual filters and the attention weights of trained Transformers, thereby validating the explanatory power and effectiveness of the proposed framework.
📝 Abstract
Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.
Problem

Research questions and friction points this paper is trying to address.

optimal control
transformer
inference
prediction
conditional probability
Innovation

Methods, ideas, or system contributions that make the work stand out.

optimal control
transformer architecture
dual filter
non-Markovian dynamics
sequence modeling
🔎 Similar Papers
No similar papers found.