๐ค AI Summary
Existing urban LLM agent studies predominantly evaluate outcome-oriented metrics (e.g., prediction accuracy), lacking fine-grained diagnostic analysis of spatiotemporal reasoning processesโthus hindering understanding of capability boundaries and optimization pathways. To address this, we propose USTBench, the first spatiotemporal reasoning benchmark for urban agents, assessing four dimensions: comprehension, prediction, planning, and reflective feedback, all evaluated at the process level within the interactive simulation environment UAgentEnv. Our method introduces a novel decoupled four-dimensional evaluation framework, integrating 62,466 structured QA pairs with end-to-end task evaluation. Extensive experiments across 13 state-of-the-art LLMs reveal significant bottlenecks in long-horizon planning and dynamic environmental adaptation; critically, general-purpose reasoning models show no consistent advantage on urban tasks, underscoring the necessity of domain-specific optimization.
๐ Abstract
Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications.