Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of efficiently and reliably evaluating real-world robotic policies, which is hindered by high hardware costs and scarce samples. To overcome the limitations of conventional approaches that only support binary success metrics, we propose a sequential testing framework based on Safe Anytime-Valid Inference (SAVI) that unifies the evaluation of diverse task metrics—including discrete partial scores and continuous rewards—enabling fine-grained, sample-efficient policy comparisons. By integrating sequential hypothesis testing with both nonparametric and parametric statistics, our method allows early termination once a pre-specified confidence level is reached. Experiments demonstrate that our approach reduces evaluation overhead by up to 70% compared to standard batch methods and by up to 50% relative to existing sequential methods designed for binary outcomes, all while maintaining statistical rigor.

Technology Category

Application Category

📝 Abstract

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.

Problem

Research questions and friction points this paper is trying to address.

robot policy evaluation

sample efficiency

statistical rigor

performance comparison

fine-grained metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

sample-efficient evaluation

statistically rigorous comparison

sequential testing