🤖 AI Summary
Behavioral foundation models (BFMs) suffer from poor test-time data efficiency in zero-shot reinforcement learning, relying either on known reward functions or extensive labeled data. To address this, we propose OpTI-BFM—the first BFM framework to incorporate the Upper Confidence Bound (UCB) principle into task inference. By adopting an optimistic decision rule that explicitly models reward uncertainty, OpTI-BFM identifies unknown tasks solely through online interaction with the environment. Built upon the successor features framework, it integrates UCB-style exploration from linear bandits to enable efficient, low-overhead online policy optimization. On standard zero-shot benchmarks, OpTI-BFM accurately identifies and adapts to unseen reward functions within only a few episodes, drastically reducing dependence on prior knowledge of reward structure or labeled demonstrations. This advances the practical deployability of BFMs in real-world settings where reward specifications are unavailable or costly to obtain.
📝 Abstract
Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at https://github.com/ThomasRupf/opti-bfm.