The Autonomy Tax: Defense Training Breaks LLM Agents

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work systematically evaluates the impact of defensive training on large language model (LLM) agents, revealing a critical trade-off between safety alignment and functional capability. Through comprehensive assessment across 97 multi-step tasks and 1,000 adversarial prompts, the study identifies a “capability–alignment paradox”: while intended to enhance security, defensive training severely degrades agents’ task execution performance and remains vulnerable to sophisticated attacks. The authors uncover three agent-specific biases—task incompetence bias, cascading amplification bias, and trigger bias—and demonstrate that defended models fail due to timeouts in 99% of tasks (versus 13% for the baseline). Large-scale adversarial evaluation, root-cause analysis, tool-use logs, and retry behavior tracking further show that most attacks readily bypass existing defenses, indicating that current approaches sacrifice practical utility without achieving meaningful security guarantees.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

Problem

Research questions and friction points this paper is trying to address.

Autonomy Tax

LLM agents

defense training

capability-alignment paradox

prompt injection attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

capability-alignment paradox

defense training

multi-step LLM agents

shortcut learning

adversarial robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow