Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In multi-turn interactive settings, uncertainty accumulation in LLM agents risks catastrophic failures, especially during tool invocation and sequential reasoning. Method: This paper proposes an “active exit” safety mechanism wherein agents autonomously terminate execution upon detecting low confidence—thereby preemptively avoiding high-risk outcomes. It systematically integrates selective exit into the LLM agent safety framework for the first time. Contribution/Results: Evaluated across 12 state-of-the-art models in the ToolEmu environment, explicit exit instructions yield an average safety improvement of +0.39 (out of 3), with proprietary models achieving up to +0.64; helpfulness declines only marginally (−0.03), demonstrating a Pareto improvement in safety–utility trade-offs. The approach provides a scalable, dynamic first-line safety defense for autonomous agents operating in high-stakes scenarios.

Technology Category

Application Category

📝 Abstract

As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.

Problem

Research questions and friction points this paper is trying to address.

Improving safety of LLM agents in complex environments

Addressing uncertainty challenges in multi-turn agent scenarios

Implementing selective quitting as safety mechanism for agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quitting mechanism improves LLM agent safety

Explicit quit instructions enhance safety-helpfulness trade-off

ToolEmu framework evaluates quitting across multiple models

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies