Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-turn interactive settings, uncertainty accumulation in LLM agents risks catastrophic failures, especially during tool invocation and sequential reasoning. Method: This paper proposes an “active exit” safety mechanism wherein agents autonomously terminate execution upon detecting low confidence—thereby preemptively avoiding high-risk outcomes. It systematically integrates selective exit into the LLM agent safety framework for the first time. Contribution/Results: Evaluated across 12 state-of-the-art models in the ToolEmu environment, explicit exit instructions yield an average safety improvement of +0.39 (out of 3), with proprietary models achieving up to +0.64; helpfulness declines only marginally (−0.03), demonstrating a Pareto improvement in safety–utility trade-offs. The approach provides a scalable, dynamic first-line safety defense for autonomous agents operating in high-stakes scenarios.

Technology Category

Application Category

📝 Abstract
As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.
Problem

Research questions and friction points this paper is trying to address.

Improving safety of LLM agents in complex environments
Addressing uncertainty challenges in multi-turn agent scenarios
Implementing selective quitting as safety mechanism for agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quitting mechanism improves LLM agent safety
Explicit quit instructions enhance safety-helpfulness trade-off
ToolEmu framework evaluates quitting across multiple models
🔎 Similar Papers
No similar papers found.
Vamshi Krishna Bonagiri
Vamshi Krishna Bonagiri
Undergratuate Researcher, Precog, IIIT Hyderabad
Machine LearningNatural Language ProcessingHCILLMs
P
Ponnurangam Kumaraguru
International Institute of Information Technology, Hyderabad (IIIT Hyderabad)
K
Khanh Nguyen
University of California, Berkeley
Benjamin Plaut
Benjamin Plaut
University of California, Berkeley
AI safetyEconomics and computation