๐ค AI Summary
Transformers in in-context reinforcement learning (ICRL) often inherit suboptimal behaviors from source algorithms or datasets, degrading cross-environment generalization. To address this, we propose Learning History Filtering (LHF), a differentiable preprocessing method that jointly models improvement and stability as a reweighting criterion to filter trajectory data at the sourceโthereby mitigating inherited suboptimality. Grounded in the weighted empirical risk minimization framework, LHF is architecture-agnostic and seamlessly integrates with mainstream ICRL paradigms including AD, DPT, and DICP. Experiments across discrete and continuous robotic benchmarks demonstrate that LHF significantly improves generalization performance, exhibits strong robustness to noisy demonstrations, and maintains stability across diverse sampling strategies. By enabling principled, differentiable data curation, LHF establishes a novel paradigm for trustworthy data preprocessing in ICRL.
๐ Abstract
Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.