Towards Provable Emergence of In-Context Reinforcement Learning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates why reinforcement learning (RL) pretraining enables in-context RL (ICRL)—the ability of agents to adapt to new tasks at inference time without parameter updates, solely via contextual inputs. We propose and validate the core hypothesis that network parameters exhibiting ICRL capability correspond to minima of the pretraining loss. Methodologically, we employ a Transformer architecture for value-function pretraining, jointly optimizing temporal-difference (TD) objectives and explicit context conditioning. Theoretically, we prove that under mild assumptions, a global minimum of the pretraining loss inherently supports context-conditional TD learning, enabling zero-gradient task adaptation. This constitutes the first optimization-geometric explanation of ICRL, formally linking pretraining dynamics to emergent in-context generalization. Our analysis reveals that ICRL is not an accidental byproduct but a structural consequence of the pretraining objective’s geometry—where context serves as a functional proxy for gradient-based adaptation.

Technology Category

Application Category

📝 Abstract
Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.
Problem

Research questions and friction points this paper is trying to address.

Understanding why RL pretraining algorithms produce parameters enabling in-context learning
Proving that pretraining loss minimizers can enable in-context reinforcement learning
Demonstrating Transformers pretrained for policy evaluation enable in-context TD learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer pretrained for policy evaluation
Global minimizers enable in-context learning
In-context temporal difference learning proven