🤖 AI Summary
Existing context-based reinforcement learning approaches—such as algorithm distillation—rely heavily on large-scale labeled datasets and suffer from brittle contextual generalization, leading to unstable training and high computational cost. To address these limitations, we propose the first integration of *n*-gram induction heads into the Transformer architecture within the algorithm distillation framework, enabling efficient, weight-update-free in-context RL. This mechanism explicitly models local sequential patterns, drastically reducing data requirements while enhancing training robustness and hyperparameter tolerance. Empirical evaluation across grid-world and pixel-based environments demonstrates that our method matches or surpasses the performance of standard algorithm distillation, achieving faster convergence and significantly improved training stability. Overall, this work establishes a novel paradigm for lightweight and robust context-based reinforcement learning.
📝 Abstract
In-context learning allows models like transformers to adapt to new tasks from a few examples without updating their weights, a desirable trait for reinforcement learning (RL). However, existing in-context RL methods, such as Algorithm Distillation (AD), demand large, carefully curated datasets and can be unstable and costly to train due to the transient nature of in-context learning abilities. In this work, we integrated the n-gram induction heads into transformers for in-context RL. By incorporating these n-gram attention patterns, we considerably reduced the amount of data required for generalization and eased the training process by making models less sensitive to hyperparameters. Our approach matches, and in some cases surpasses, the performance of AD in both grid-world and pixel-based environments, suggesting that n-gram induction heads could improve the efficiency of in-context RL.