🤖 AI Summary
This work formulates sequence prediction as an optimal control problem and, for the first time, derives from optimal control principles an explicit inference algorithm—dual filtering—that exhibits a Transformer-like layer structure. For both discrete nonlinear and linear Gaussian systems, the approach naturally yields analytical expressions of attention mechanisms, revealing their deep connection to optimal control solutions. The analysis further indicates that Transformers implicitly exploit non-Markovian structures through low-dimensional embeddings. Numerical experiments demonstrate a high degree of alignment between the derived dual filters and the attention weights of trained Transformers, thereby validating the explanatory power and effectiveness of the proposed framework.
📝 Abstract
Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.