🤖 AI Summary
This work addresses the critical yet overlooked issue that task success often masks underlying process anomalies—such as unresolved ambiguities, unsafe writes, and ignored errors—rendering outcome-based evaluation insufficient for assessing runtime reliability. The study introduces the first systematic definition and quantification of the “outcome-process gap,” presenting OpenClawBench, a large-scale structured dataset comprising execution trajectories from six source models driven by BFCL logs. Leveraging the FullTax annotation framework, the dataset provides fine-grained labels including binary anomaly indicators, evidence localization, severity scores, and categorization across five anomaly types. Among 31,135 ostensibly successful trajectories, 2,904 exhibit process-level anomalies. A Gemma-3 12B detector, fine-tuned on high-quality annotations, achieves an F1 score of 0.729 on the held-out test set, demonstrating the efficacy of the proposed approach.
📝 Abstract
Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.