Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work investigates whether Transformers can perform reinforcement learning solely through in-context trajectories under frozen parameters. The authors explicitly construct a linear self-attention module that precisely implements policy improvement algorithms such as semi-gradient SARSA and Actor-Critic. By integrating teacher imitation training with gradient flow dynamics analysis, they enable end-to-end learning while keeping model parameters fixed. This study provides the first convergence guarantees for in-context reinforcement learning, proving that the learned representations can locally converge exponentially to the optimal parameter manifold. Experiments on random tabular Markov decision processes validate the theoretical predictions: the trained Transformer not only recovers the theoretically prescribed parameter structure but also demonstrates strong contextual control capabilities in unseen environments.
📝 Abstract
We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.
Problem

Research questions and friction points this paper is trying to address.

in-context reinforcement learning
transformers
policy improvement
parameter updates
trajectory data
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context reinforcement learning
transformer architecture
policy improvement
convergence guarantee
gradient flow dynamics