In-Context Reinforcement Learning From Suboptimal Historical Data

📅 2026-01-27
🏛️ International Conference on Machine Learning
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses a key limitation of conventional autoregressive approaches in offline contextual reinforcement learning, which merely imitate suboptimal behavioral policies and struggle to recover optimal ones. To overcome this, the authors propose the Decision Importance Transformer (DIT) framework, which introduces the actor-critic paradigm into contextual reinforcement learning for the first time. DIT leverages a Transformer architecture to model the advantage function over suboptimal trajectories and employs an advantage-weighted maximum likelihood objective to train the policy model. This approach effectively corrects policy bias present in historical data, thereby surpassing the performance ceiling of standard imitation learning. Empirical results demonstrate that DIT significantly outperforms existing baselines on both multi-armed bandit and Markov decision process tasks, exhibiting particularly strong policy optimization capabilities when trained on suboptimal datasets.

Technology Category

Application Category

📝 Abstract
Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
Problem

Research questions and friction points this paper is trying to address.

in-context reinforcement learning
suboptimal historical data
offline reinforcement learning
autoregressive transformer
imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning
Suboptimal Offline Data
Decision Importance Transformer
Transformer-based Value Function
Weighted Maximum Likelihood Estimation
🔎 Similar Papers