🤖 AI Summary
This work addresses the problem of trust calibration in autonomous agents—specifically, how an agent should dynamically decide whether to act independently or seek human approval when using automated tools. The paper formalizes this challenge as a preference learning task for the first time. It introduces a policy gateway that maintains a Gaussian process posterior over the human’s risk tolerance function, employing a probit likelihood and an approximate Gaussian process classification model to infer preferences from binary approve/reject feedback. The agent actively queries human input at points of highest uncertainty, thereby establishing a three-region decision mechanism: “allow,” “block,” and “ask.” This approach extends the applicability of preference-based Bayesian optimization and achieves sample-efficient trust calibration, accurately partitioning the action space while substantially reducing unnecessary human interventions.
📝 Abstract
We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.