🤖 AI Summary
Current evaluations of AI agents are largely confined to context-free, single-turn interactions, which fail to capture their performance in real-world, complex tasks. This work proposes the first dynamic benchmark that integrates temporally evolving personal context, interactive tool use, and multi-step reasoning. The benchmark features 2,413 multi-agent scenarios generated through an event-driven pipeline, each annotated with triple-dimensional complexity—referential, functional, and informational—and is accompanied by an executable environment and automated evaluation scripts. Experiments reveal a significant performance drop among leading large language models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) on high-complexity tasks, exposing fundamental limitations in contextual awareness and reliable planning, with parameter generation identified as a critical bottleneck.
📝 Abstract
Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \& Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents' ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.