Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the lack of theoretical understanding regarding how chain-of-thought (CoT) reasoning enhances in-context reinforcement learning (ICRL). Within a policy evaluation framework, the authors model the CoT generation process using a linear Transformer and formally prove, for the first time, that under a specific parameter configuration—corresponding to the global minimum of the pretraining loss—CoT inference is equivalent to a temporal difference (TD) learning update. Further analysis reveals that the policy evaluation error decays geometrically with increasing CoT length and converges to a statistical lower bound determined by the context length. These results elucidate both the mechanistic role and convergence properties of CoT in ICRL.

📝 Abstract

In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

Problem

Research questions and friction points this paper is trying to address.

In-Context Reinforcement Learning

Chain of Thought

Policy Evaluation

Transformer

Convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning

Chain-of-Thought

Temporal Difference Learning