PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of conventional robot evaluation metrics that rely predominantly on binary success rates, which fail to capture fine-grained aspects such as progress, efficiency, and stability during task execution. To overcome this, the authors propose the PRM-as-a-Judge paradigm, introducing a novel dense evaluation framework grounded in potential functions. This framework leverages a Process Reward Model (PRM) to directly assess policy quality from trajectory videos and establishes the OPD metric suite to quantify task-aligned progress potential. The approach satisfies two key axiomatic properties—macroscopic consistency and microscopic resolution—enabling fine-grained auditing of robotic behaviors. Experiments on the RoboPulse diagnostic benchmark demonstrate that PRM substantially outperforms similarity-based metrics and general-purpose large models, effectively uncovering latent failure modes in state-of-the-art policies during long-horizon tasks.

Technology Category

Application Category

📝 Abstract
Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
Problem

Research questions and friction points this paper is trying to address.

robotic evaluation
dense evaluation
task progress
execution quality
fine-grained auditing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model
dense evaluation
OPD metric
macro-consistency
micro-resolution
Yuheng Ji
Yuheng Ji
Institute of Automation, Chinese Academy of Sciences
Embodied AIComputer Vision
Yuyang Liu
Yuyang Liu
IIIS(Institute for Interdisciplinary Information Sciences), Tsinghua University
Computer VisionUnsupervised LearningRoboticsReinforcement Learning
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
Xuchuan Huang
Xuchuan Huang
Peking University
Robot LearningDexterous Manipulation
Fanding Huang
Fanding Huang
Tsinghua University
Semantic SegmentationTest-time AdaptationLarge Language Models
Yijie Xu
Yijie Xu
Hong Kong University of Science and Technology (Guangzhou)
Data MiningNatural Language ProcessingLarge Language Models
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
Yuting Zhao
Yuting Zhao
Institute of Automation, Chinese Academy of Sciences
Computer Vision
Huaihai Lyu
Huaihai Lyu
Institute of Automation
multi-modalembodied intelligence
P
Peterson Co
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing Academy of Artificial Intelligence
M
Mingyu Cao
Beijing Academy of Artificial Intelligence
Q
Qiongyu Zhang
Beijing Academy of Artificial Intelligence, University of Sydney
Z
Zhe Li
Beijing Academy of Artificial Intelligence
Enshen Zhou
Enshen Zhou
Beihang University
Embodied AIEmbodied AgentRobot LearningGenerative Model
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Zhongyuan Wang
Zhongyuan Wang
BAAI
Knowledge MiningDatabaseNLPText Understanding
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Xiaolong Zheng
Xiaolong Zheng
Institute of Automation, Chinese Academy of Sciences & School of Artificial Intelligence, UCAS
Big Data AnalyticsSocial ComputingGeneral Artifical IntelligenceParallel Intelligence