UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing evaluation benchmarks focus on short-horizon, fully observable tasks, failing to assess agents’ sustained reasoning, planning, memory, and tool orchestration in long-horizon, partially observable real-world settings (e.g., software development, scientific discovery). Method: We introduce UltraHorizon—the first ultra-long-horizon benchmark supporting trajectory evaluation exceeding 200K tokens—featuring three high-complexity simulated environments where agents progressively discover implicit rules. We systematically evaluate multi-step reasoning, dynamic planning, long-term memory retention, and tool usage in concert. Results: Our experiments reveal substantial performance gaps in current LLM-based agents; scaling model size alone does not close these gaps. Core failures stem from context locking and intrinsic functional limitations. This work establishes the necessity of evaluating long-horizon cognitive capabilities and provides a foundational assessment paradigm for next-generation autonomous agents.

Technology Category

Application Category

📝 Abstract

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce extbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average extbf{200k+} tokens and extbf{400+} tool calls, whereas in standard configurations they still exceed extbf{35k} tokens and involve more than extbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}

Problem

Research questions and friction points this paper is trying to address.

Evaluating agent capabilities in ultra long-horizon scenarios with partial observability

Measuring sustained reasoning, planning, memory management and tool use abilities

Benchmarking performance in complex real-world tasks requiring iterative discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for ultra long-horizon agent evaluation

Uses exploration tasks with hidden rule discovery

Tests reasoning, planning, memory and tool management

🔎 Similar Papers

No similar papers found.