π€ AI Summary
Existing benchmarks lack the capacity to evaluate large language modelsβ adaptive replanning capabilities under spatiotemporal dynamic perturbations. This work introduces STT-Arena, a simulated environment comprising 227 executable interactive tasks that model nine categories of spatiotemporal conflicts and four levels of solvability, employing spatiotemporal triggers to compel agents to replan upon plan failure. We provide the first systematic definition and evaluation of this capability, uncovering three prevalent error patterns. Building on these insights, we propose STT-Agent-4B, which integrates trajectory iterative refinement with online reinforcement learning. Experimental results demonstrate that state-of-the-art models achieve less than 40% accuracy on this benchmark, whereas STT-Agent-4B significantly outperforms existing approaches.
π Abstract
Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.