🤖 AI Summary
This work addresses the limitation of conventional robot evaluation metrics that rely predominantly on binary success rates, which fail to capture fine-grained aspects such as progress, efficiency, and stability during task execution. To overcome this, the authors propose the PRM-as-a-Judge paradigm, introducing a novel dense evaluation framework grounded in potential functions. This framework leverages a Process Reward Model (PRM) to directly assess policy quality from trajectory videos and establishes the OPD metric suite to quantify task-aligned progress potential. The approach satisfies two key axiomatic properties—macroscopic consistency and microscopic resolution—enabling fine-grained auditing of robotic behaviors. Experiments on the RoboPulse diagnostic benchmark demonstrate that PRM substantially outperforms similarity-based metrics and general-purpose large models, effectively uncovering latent failure modes in state-of-the-art policies during long-horizon tasks.
📝 Abstract
Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.