🤖 AI Summary
Evaluating robot policies in real-world environments is costly and unreliable. Method: This paper proposes a prediction-empowered inference framework that jointly leverages simulation and real-world testing. It formalizes the simulation-to-reality evaluation task as a statistical inference problem of bias correction, learning simulation biases from paired data and constructing confidence intervals for policy performance using a non-asymptotic mean estimation algorithm. The approach integrates physics-based simulation, diffusion-based policies, and multi-task fine-tuning of a pretrained π₀ model to enable trustworthy mapping from simulated outcomes to real-world performance. Results: Experiments demonstrate that, under equivalent performance guarantees, the method reduces real-world testing resource consumption by 20–25% compared to pure hardware evaluation, significantly improving assessment efficiency and scalability. The core contribution lies in establishing a statistically principled framework for joint simulation-reality evaluation, accompanied by finite-sample theoretical guarantees and a practical algorithm.
📝 Abstract
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned (π_0) on a joint distribution of objects and initial conditions, and find that our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.