🤖 AI Summary
Evaluating large language model (LLM) agents’ capabilities in long-horizon, asynchronous, multi-task collaborative planning remains an open challenge. Method: We introduce the first asynchronous planning benchmark for LLM agents, grounded in a scalable simulation environment. It systematically models real-world complexities—including temporal overlap, task interruption, and dynamic resumption—and comprises synchronized and asynchronous dual-track task datasets. We employ prompting frameworks (e.g., ReAct) for behavioral analysis and failure attribution. Contribution/Results: We formally define and evaluate three core capabilities—*asynchronous temporal modeling*, *cross-task dependency reasoning*, and *dynamic self-auditing*. Experiments reveal a stark performance gap: GPT-4o with ReAct achieves 47% accuracy on synchronous tasks but only 11% on asynchronous ones, exposing fundamental limitations in temporal reasoning and long-term feedback integration. This benchmark establishes a critical evaluation baseline and identifies concrete directions for advancing LLM agent planning architectures.
📝 Abstract
Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.