๐ค AI Summary
This work addresses a critical limitation in current evaluations of large language modelโbased coding agents, which predominantly focus on final outcomes and thus fail to diagnose procedural flaws in multi-step execution. To overcome this, the authors propose ProcBench, a novel framework that introduces, for the first time, a process-oriented defect ontology coupled with a risk-calibrated scoring mechanism. By unifying trajectory representation, standardizing execution logs, and leveraging ontology-based modeling, ProcBench systematically assesses agentsโ procedural errors and control retention throughout task execution. Instantiated with 200 human-annotated trajectories, the framework demonstrates effectiveness across AndroidBench, TerminalBench, and SWE-bench-Verified, significantly enhancing the interpretability of defect identification. It transcends the conventional reliance on task completion rates, offering instead diagnostic, process-level insights into agent behavior.
๐ Abstract
Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.