🤖 AI Summary
Existing benchmarks predominantly focus on short-horizon, atomic tasks, failing to assess LLM-based agents’ capabilities in real-world office settings—specifically their handling of long-horizon, cross-application, multi-step collaborative workflows and persistent contextual dependencies. To address this gap, we propose OdysseyBench: the first systematic evaluation benchmark designed for complex office workflows. It comprises two complementary task suites and is integrated with HomerAgents—a modular framework enabling multi-agent coordination, environment exploration, automated task generation, and dialogue synthesis. This integration facilitates modeling of extended interaction histories and enables fully automated, scalable evaluation. OdysseyBench significantly increases evaluation rigor while preserving ecological validity, offering a more accurate, holistic assessment of state-of-the-art LLM agents’ productivity-oriented competencies. It is the first benchmark to systematically evaluate long-horizon reasoning and execution in office automation, thereby filling a critical void in agent evaluation for enterprise-scale intelligent assistants.
📝 Abstract
Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.