Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing safety evaluation benchmarks struggle to capture trajectory-level risks in emerging execution environments such as OpenClaw and Codex. To address this gap, this work proposes a scalable, domain-specific methodology for constructing safety trajectory benchmarks based on the ATBench framework. The approach employs a tailored three-dimensional safety taxonomy—encompassing risk sources, failure modes, and real-world harms—and integrates a shared generative pipeline with domain-specific risk modeling to construct trajectory datasets centered on critical elements including toolchains, sessions, code repositories, and dependencies. The resulting benchmarks, ATBench-Claw and ATBench-CodeX, respectively cover sensitive operation chains in OpenClaw and runtime trajectories in Codex, substantially enhancing diagnostic capabilities for safety issues in real-world deployment scenarios.

Technology Category

Application Category

📝 Abstract

As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-CodeX, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-CodeX targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.

Problem

Research questions and friction points this paper is trying to address.

trajectory safety

benchmark

agent systems

safety evaluation

diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory safety

safety taxonomy

benchmark extensibility