OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the critical yet overlooked issue that task success often masks underlying process anomalies—such as unresolved ambiguities, unsafe writes, and ignored errors—rendering outcome-based evaluation insufficient for assessing runtime reliability. The study introduces the first systematic definition and quantification of the “outcome-process gap,” presenting OpenClawBench, a large-scale structured dataset comprising execution trajectories from six source models driven by BFCL logs. Leveraging the FullTax annotation framework, the dataset provides fine-grained labels including binary anomaly indicators, evidence localization, severity scores, and categorization across five anomaly types. Among 31,135 ostensibly successful trajectories, 2,904 exhibit process-level anomalies. A Gemma-3 12B detector, fine-tuned on high-quality annotations, achieves an F1 score of 0.729 on the held-out test set, demonstrating the efficacy of the proposed approach.

📝 Abstract

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

Problem

Research questions and friction points this paper is trying to address.

Outcome-Process Gap

process anomalies

agent execution

task success evaluation

runtime reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Outcome-Process Gap

process-side anomalies

agent execution trajectories