Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating robot policies in real-world environments is costly and unreliable. Method: This paper proposes a prediction-empowered inference framework that jointly leverages simulation and real-world testing. It formalizes the simulation-to-reality evaluation task as a statistical inference problem of bias correction, learning simulation biases from paired data and constructing confidence intervals for policy performance using a non-asymptotic mean estimation algorithm. The approach integrates physics-based simulation, diffusion-based policies, and multi-task fine-tuning of a pretrained π₀ model to enable trustworthy mapping from simulated outcomes to real-world performance. Results: Experiments demonstrate that, under equivalent performance guarantees, the method reduces real-world testing resource consumption by 20–25% compared to pure hardware evaluation, significantly improving assessment efficiency and scalability. The core contribution lies in establishing a statistically principled framework for joint simulation-reality evaluation, accompanied by finite-sample theoretical guarantees and a practical algorithm.

Technology Category

Application Category

📝 Abstract
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned (π_0) on a joint distribution of objects and initial conditions, and find that our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robot policies with limited real-world trials
Combining simulation and real data for reliable assessment
Reducing hardware testing needs while ensuring performance accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining real and simulation evaluations via prediction-powered inference
Using non-asymptotic mean estimation for confidence intervals
Rectifying simulation bias with limited paired real-world data