Filtering Learning Histories Enhances In-Context Reinforcement Learning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Transformers in in-context reinforcement learning (ICRL) often inherit suboptimal behaviors from source algorithms or datasets, degrading cross-environment generalization. To address this, we propose Learning History Filtering (LHF), a differentiable preprocessing method that jointly models improvement and stability as a reweighting criterion to filter trajectory data at the source—thereby mitigating inherited suboptimality. Grounded in the weighted empirical risk minimization framework, LHF is architecture-agnostic and seamlessly integrates with mainstream ICRL paradigms including AD, DPT, and DICP. Experiments across discrete and continuous robotic benchmarks demonstrate that LHF significantly improves generalization performance, exhibits strong robustness to noisy demonstrations, and maintains stability across diverse sampling strategies. By enabling principled, differentiable data curation, LHF establishes a novel paradigm for trustworthy data preprocessing in ICRL.

Technology Category

Application Category

📝 Abstract

Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.

Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal behavior transfer in ICRL

Proposes filtering learning histories to enhance performance

Improves robustness in noisy and suboptimal scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Filters learning histories to enhance performance

Reweights data based on improvement and stability

Combines with SOTA ICRL algorithms effectively

🔎 Similar Papers

Retrieval-Augmented Decision Transformer: External Memory for In-context RL