ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

202K/year
๐Ÿค– AI Summary
This work addresses a critical limitation in current evaluations of large language modelโ€“based coding agents, which predominantly focus on final outcomes and thus fail to diagnose procedural flaws in multi-step execution. To overcome this, the authors propose ProcBench, a novel framework that introduces, for the first time, a process-oriented defect ontology coupled with a risk-calibrated scoring mechanism. By unifying trajectory representation, standardizing execution logs, and leveraging ontology-based modeling, ProcBench systematically assesses agentsโ€™ procedural errors and control retention throughout task execution. Instantiated with 200 human-annotated trajectories, the framework demonstrates effectiveness across AndroidBench, TerminalBench, and SWE-bench-Verified, significantly enhancing the interpretability of defect identification. It transcends the conventional reliance on task completion rates, offering instead diagnostic, process-level insights into agent behavior.
๐Ÿ“ Abstract
Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.
Problem

Research questions and friction points this paper is trying to address.

process-level defects
control preservation
LLM coding agents
execution trajectories
benchmark evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

process-level evaluation
coding agent trajectory
defect ontology
control preservation
risk-calibrated scorecard
๐Ÿ”Ž Similar Papers
No similar papers found.