🤖 AI Summary
This work addresses meta-learning in multi-task structured bandits, aiming to pretrain a Decision Transformer that achieves zero-shot rapid adaptation and outperforms expert demonstrators on unseen test tasks. To overcome limitations of existing approaches—which either rely on optimal action labels or fail to surpass demonstrator performance—we propose the first reward-prediction-based pretraining paradigm that requires no supervision from optimal actions. Our method integrates structure-aware context serialization with multi-task meta-training to enable self-supervised policy optimization. It leverages structured bandit trajectories for training and supports in-context policy adaptation. Experiments across diverse structured bandit domains demonstrate significant reductions in cumulative regret, consistent superiority over demonstrators, strong cross-task generalization, and robust zero-shot adaptability.
📝 Abstract
We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.