🤖 AI Summary
Existing trajectory-level benchmarks for evaluating the safety of large language model (LLM) agents suffer from limited interaction diversity, coarse-grained failure observation, and insufficient realism over extended horizons. To address these limitations, this work proposes the first structured evaluation framework that integrates three dimensions: risk sources, failure modes, and real-world harm. The framework incorporates a heterogeneous tool pool and a delayed-trigger mechanism to simulate multi-stage risk evolution. It comprises 1,000 high-quality trajectories averaging 9.01 turns and 3.95k tokens, constructed through rule-based filtering, LLM-assisted selection, and human review, encompassing 1,954 tool invocations. Experimental results demonstrate that the benchmark poses significant challenges to both leading closed-source and open-source models as well as existing safety mitigation systems, effectively uncovering fine-grained and long-horizon safety failure patterns.
📝 Abstract
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.