Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing in-context reinforcement learning (ICRL) methods employ Transformers to imitate stepwise RL algorithms, avoiding parameter updates but inheriting suboptimality from the imitated policies. Method: We propose a model-based planning (MBP) framework that jointly distills implicit environmental dynamics modeling and in-context planning capability into a Transformer—enabling simultaneous environment modeling and policy optimization without parameter updates. Contribution/Results: Our approach breaks free from the conventional reliance on iterative update rules in imitation learning, supports both discrete and continuous control domains, and achieves state-of-the-art performance on benchmarks including Darkroom variants and Meta-World. It significantly reduces environmental interaction counts, outperforming model-agnostic meta-RL and diverse model-free baselines.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.

Problem

Research questions and friction points this paper is trying to address.

Improves in-context reinforcement learning efficiency

Reduces suboptimal behavior in RL algorithms

Integrates model-based planning into Transformer frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers for in-context RL

Distillation for In-Context Planning

Simultaneous dynamics and policy learning

🔎 Similar Papers

No similar papers found.