🤖 AI Summary
In multi-turn interactive settings, uncertainty accumulation in LLM agents risks catastrophic failures, especially during tool invocation and sequential reasoning.
Method: This paper proposes an “active exit” safety mechanism wherein agents autonomously terminate execution upon detecting low confidence—thereby preemptively avoiding high-risk outcomes. It systematically integrates selective exit into the LLM agent safety framework for the first time.
Contribution/Results: Evaluated across 12 state-of-the-art models in the ToolEmu environment, explicit exit instructions yield an average safety improvement of +0.39 (out of 3), with proprietary models achieving up to +0.64; helpfulness declines only marginally (−0.03), demonstrating a Pareto improvement in safety–utility trade-offs. The approach provides a scalable, dynamic first-line safety defense for autonomous agents operating in high-stakes scenarios.
📝 Abstract
As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.