Optimistic Task Inference for Behavior Foundation Models

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Behavioral foundation models (BFMs) suffer from poor test-time data efficiency in zero-shot reinforcement learning, relying either on known reward functions or extensive labeled data. To address this, we propose OpTI-BFM—the first BFM framework to incorporate the Upper Confidence Bound (UCB) principle into task inference. By adopting an optimistic decision rule that explicitly models reward uncertainty, OpTI-BFM identifies unknown tasks solely through online interaction with the environment. Built upon the successor features framework, it integrates UCB-style exploration from linear bandits to enable efficient, low-overhead online policy optimization. On standard zero-shot benchmarks, OpTI-BFM accurately identifies and adapts to unseen reward functions within only a few episodes, drastically reducing dependence on prior knowledge of reward structure or labeled demonstrations. This advances the practical deployability of BFMs in real-world settings where reward specifications are unavailable or costly to obtain.

Technology Category

Application Category

📝 Abstract
Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at https://github.com/ThomasRupf/opti-bfm.
Problem

Research questions and friction points this paper is trying to address.

Optimizing task inference through environment interaction without reward labels
Reducing data dependency for Behavior Foundation Models during test-time
Enabling efficient reward function identification with minimal computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic task inference through environment interaction
Models uncertainty over reward functions directly
Minimal compute overhead for unseen reward optimization
🔎 Similar Papers
No similar papers found.
T
Thomas Rupf
ETH Zürich, Switzerland
M
Marco Bagatella
ETH Zürich, Switzerland and Max Planck Institute for Intelligent Systems, Germany
Marin Vlastelica
Marin Vlastelica
Postdoctoral Fellow @ ETH AI Center
machine learningreinforcement learninggenerative modelling
A
Andreas Krause
ETH Zürich, Switzerland