"You just can't go around killing people"Explaining Agent Behavior to a Human Terminator

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the decision-optimization problem of dynamic human takeover from pre-trained autonomous agents in human–machine collaboration, aiming to balance takeover frequency and system safety. We propose a “controllable takeover” interaction paradigm, whose core innovation is the first formal modeling of takeover intent—integrating counterfactual reasoning with policy confidence estimation to jointly determine both the necessity and optimal timing of takeover in an interpretable manner, while supporting online confidence calibration. Evaluated in simulated autonomous driving and medical assistance tasks, our approach reduces spurious takeovers by 37% and improves task success rate by 12%, thereby significantly enhancing human–machine trust and collaborative efficiency.

Technology Category

Application Category

📝 Abstract

Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases -- if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.

Problem

Research questions and friction points this paper is trying to address.

Balancing human take-overs and agent autonomy

Optimizing human-agent interaction in critical scenarios

Reducing dangerous policies while maintaining agent usefulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained agent with human take-over option

Explainability scheme to optimize interventions

Balances safety and agent autonomy

🔎 Similar Papers

Policy Learning with a Language Bottleneck