Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work establishes the first theoretical connection between Transformers and in-context reinforcement learning (ICRL) without relying on the unrealistic linear attention assumption commonly adopted in prior analyses. We prove that the layer-wise forward propagation of a standard softmax-attention Transformer, under a specific parameter configuration, is equivalent to the iterative updates of a newly proposed weighted softmax temporal difference (TD) learning algorithm—which naturally subsumes both linear and tabular TD as special cases. Leveraging kernel-based policy evaluation and TD theory, we show that under contraction conditions, the policy evaluation error decays exponentially with network depth. Moreover, this parameter configuration corresponds to a global minimum of the pretraining loss and emerges spontaneously during training, enabling effective ICRL without any fine-tuning.

📝 Abstract

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.

Problem

Research questions and friction points this paper is trying to address.

in-context reinforcement learning

softmax attention

Transformer

theoretical analysis

temporal difference learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

softmax attention

in-context reinforcement learning

temporal difference learning