🤖 AI Summary
In safety-critical ad hoc teamwork settings, existing offline reinforcement learning (RL) methods can learn high-performance policies but lack statistical safety guarantees against predefined undesirable behaviors (e.g., harming teammates).
Method: This paper introduces Seldonian optimization into the offline RL framework—first such integration—proposing a purely offline, model-free safety-aware policy selection method. It requires no online interaction, imposes no structural assumptions on the policy class, and avoids exact modeling of other agents’ policies. Leveraging only offline data, it combines conservative policy evaluation, constraint confidence interval estimation, and candidate policy search to enforce verifiable behavioral safety (e.g., a statistically grounded lower bound on zero-harm probability) while preserving performance.
Results: On standard ad hoc teamwork benchmarks, our method consistently yields high-quality policies satisfying stringent safety constraints, achieving significantly higher sample efficiency than leading offline RL baselines.
📝 Abstract
Most offline RL algorithms return optimal policies but do not provide statistical guarantees on undesirable behaviors. This could generate reliability issues in safety-critical applications, such as in some multiagent domains where agents, and possibly humans, need to interact to reach their goals without harming each other. In this work, we propose a novel offline RL approach, inspired by Seldonian optimization, which returns policies with good performance and statistically guaranteed properties with respect to predefined undesirable behaviors. In particular, our focus is on Ad Hoc Teamwork settings, where agents must collaborate with new teammates without prior coordination. Our method requires only a pre-collected dataset, a set of candidate policies for our agent, and a specification about the possible policies followed by the other players -- it does not require further interactions, training, or assumptions on the type and architecture of the policies. We test our algorithm in Ad Hoc Teamwork problems and show that it consistently finds reliable policies while improving sample efficiency with respect to standard ML baselines.