ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses a critical limitation in current evaluations of large language model–based coding agents, which predominantly focus on final outcomes and thus fail to diagnose procedural flaws in multi-step execution. To overcome this, the authors propose ProcBench, a novel framework that introduces, for the first time, a process-oriented defect ontology coupled with a risk-calibrated scoring mechanism. By unifying trajectory representation, standardizing execution logs, and leveraging ontology-based modeling, ProcBench systematically assesses agents’ procedural errors and control retention throughout task execution. Instantiated with 200 human-annotated trajectories, the framework demonstrates effectiveness across AndroidBench, TerminalBench, and SWE-bench-Verified, significantly enhancing the interpretability of defect identification. It transcends the conventional reliance on task completion rates, offering instead diagnostic, process-level insights into agent behavior.

📝 Abstract

Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.

Problem

Research questions and friction points this paper is trying to address.

process-level defects

control preservation

LLM coding agents

execution trajectories

benchmark evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

process-level evaluation

coding agent trajectory

defect ontology