Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

๐Ÿ“… 2026-05-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
This work establishes the first theoretical connection between Transformers and in-context reinforcement learning (ICRL) without relying on the unrealistic linear attention assumption commonly adopted in prior analyses. We prove that the layer-wise forward propagation of a standard softmax-attention Transformer, under a specific parameter configuration, is equivalent to the iterative updates of a newly proposed weighted softmax temporal difference (TD) learning algorithmโ€”which naturally subsumes both linear and tabular TD as special cases. Leveraging kernel-based policy evaluation and TD theory, we show that under contraction conditions, the policy evaluation error decays exponentially with network depth. Moreover, this parameter configuration corresponds to a global minimum of the pretraining loss and emerges spontaneously during training, enabling effective ICRL without any fine-tuning.
๐Ÿ“ Abstract
In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.
Problem

Research questions and friction points this paper is trying to address.

in-context reinforcement learning
softmax attention
Transformer
theoretical analysis
temporal difference learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

softmax attention
in-context reinforcement learning
temporal difference learning
Transformer
kernel policy evaluation