UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks focus on short-horizon, fully observable tasks, failing to assess agents’ sustained reasoning, planning, memory, and tool orchestration in long-horizon, partially observable real-world settings (e.g., software development, scientific discovery). Method: We introduce UltraHorizon—the first ultra-long-horizon benchmark supporting trajectory evaluation exceeding 200K tokens—featuring three high-complexity simulated environments where agents progressively discover implicit rules. We systematically evaluate multi-step reasoning, dynamic planning, long-term memory retention, and tool usage in concert. Results: Our experiments reveal substantial performance gaps in current LLM-based agents; scaling model size alone does not close these gaps. Core failures stem from context locking and intrinsic functional limitations. This work establishes the necessity of evaluating long-horizon cognitive capabilities and provides a foundational assessment paradigm for next-generation autonomous agents.

Technology Category

Application Category

📝 Abstract
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce extbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average extbf{200k+} tokens and extbf{400+} tool calls, whereas in standard configurations they still exceed extbf{35k} tokens and involve more than extbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}
Problem

Research questions and friction points this paper is trying to address.

Evaluating agent capabilities in ultra long-horizon scenarios with partial observability
Measuring sustained reasoning, planning, memory management and tool use abilities
Benchmarking performance in complex real-world tasks requiring iterative discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for ultra long-horizon agent evaluation
Uses exploration tasks with hidden rule discovery
Tests reasoning, planning, memory and tool management
🔎 Similar Papers
No similar papers found.
H
Haotian Luo
Didichuxing Co. Ltd
Huaisong Zhang
Huaisong Zhang
Tsinghua University
X
Xuelin Zhang
Didichuxing Co. Ltd
H
Haoyu Wang
Tsinghua University
Zeyu Qin
Zeyu Qin
Hong Kong University of Science and Technology
Machine LearningDeep LearningScalable OversightAI Safety
W
Wenjie Lu
Didichuxing Co. Ltd
Guozheng Ma
Guozheng Ma
Nanyang Technological University
Reinforcement LearningDeep Learning
Haiying He
Haiying He
China Agricultural University
LLMMLLMAgent
Y
Yingsha Xie
Sun Yat-sen University
Q
Qiyang Zhou
Sun Yat-sen University
Z
Zixuan Hu
Nanyang Technological University
H
Hongze Mi
Tianjin University
Y
Yibo Wang
Tsinghua University
N
Naiqiang Tan
Didichuxing Co. Ltd
H
Hong Chen
Huazhong Agricultural University
Y
Yi R. Fung
HKUST
C
Chun Yuan
Tsinghua University
L
Li Shen
Sun Yat-sen University