AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the limitations of large language model agents in handling long-range dependencies and tool usage within unfamiliar environments by introducing the first escape-room–style evaluation benchmark that explicitly models such dependencies. The framework encodes multi-step dependencies among tools and objects using directed acyclic graphs, requiring agents to invoke real external functions, track dynamically revealed hidden states, propagate intermediate results, and produce verifiable answers. Experiments across 270 tasks spanning five difficulty levels demonstrate that both human and state-of-the-art model success rates decline significantly with increasing dependency depth, exposing critical bottlenecks in deep state tracking, clue following, and intermediate information propagation. These findings underscore the benchmark’s effectiveness and necessity for evaluating complex reasoning capabilities in embodied agent settings.
📝 Abstract
As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.
Problem

Research questions and friction points this paper is trying to address.

tool-grounded reasoning
out-of-domain generalization
long-range dependencies
LLM agents
escape-room benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-grounded reasoning
long-range dependency
escape-room benchmark
LLM agents
state tracking
🔎 Similar Papers
No similar papers found.