The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the significant performance degradation of large language model (LLM) agents in long-horizon tasks, a problem exacerbated by the absence of systematic diagnostic frameworks and cross-domain evaluation methodologies. To bridge this gap, the authors introduce HORIZON, the first benchmark that establishes a multi-domain failure diagnosis framework for long-horizon agents, encompassing over 3,100 execution trajectories. They further propose a scalable, trajectory-based LLM-as-a-Judge attribution method to analyze failure modes, which demonstrates high inter-annotator agreement (Cohen’s κ = 0.84) against human annotations, ensuring reliability and reproducibility. The study empirically characterizes the performance decay of agents as task horizons extend, offering both foundational insights and methodological guidance for developing more robust long-horizon autonomous agents.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator κ=0.61; human-judge κ=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \href{https://xwang2775.github.io/horizon-leaderboard/}{HORIZON Leaderboard} and welcome contributions from the community.

Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks

agentic systems

failure diagnosis

LLM agents

cross-domain benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon tasks

agentic systems

failure diagnosis