Optimistic Task Inference for Behavior Foundation Models

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Behavioral foundation models (BFMs) suffer from poor test-time data efficiency in zero-shot reinforcement learning, relying either on known reward functions or extensive labeled data. To address this, we propose OpTI-BFM—the first BFM framework to incorporate the Upper Confidence Bound (UCB) principle into task inference. By adopting an optimistic decision rule that explicitly models reward uncertainty, OpTI-BFM identifies unknown tasks solely through online interaction with the environment. Built upon the successor features framework, it integrates UCB-style exploration from linear bandits to enable efficient, low-overhead online policy optimization. On standard zero-shot benchmarks, OpTI-BFM accurately identifies and adapts to unseen reward functions within only a few episodes, drastically reducing dependence on prior knowledge of reward structure or labeled demonstrations. This advances the practical deployability of BFMs in real-world settings where reward specifications are unavailable or costly to obtain.

Technology Category

Application Category

📝 Abstract

Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at https://github.com/ThomasRupf/opti-bfm.

Problem

Research questions and friction points this paper is trying to address.

Optimizing task inference through environment interaction without reward labels

Reducing data dependency for Behavior Foundation Models during test-time

Enabling efficient reward function identification with minimal computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic task inference through environment interaction

Models uncertainty over reward functions directly

Minimal compute overhead for unseen reward optimization

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories