Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

In imitation learning, policy evaluation is often constrained by small sample sizes (e.g., n = 10–50), leading to rigid and inefficient comparisons. Method: This paper introduces the first adaptive sequential policy comparison framework tailored for robotic manipulation tasks. It integrates sequential hypothesis testing, confidence sequence analysis, and optimal stopping theory to jointly optimize statistical rigor—controlling both Type I and Type II errors—and sampling efficiency. Contribution/Results: The framework dynamically prunes redundant trials in the most challenging-to-distinguish scenarios, reducing real-world experiment counts by up to 40% over state-of-the-art batch testing methods while preserving statistical power. In multi-task simulation benchmarks, it saves over 200 rollouts. To our knowledge, this is the first application of near-optimal sequential testing to small-sample imitation learning policy evaluation, breaking the rigidity of fixed-sample paradigms.

Technology Category

Application Category

📝 Abstract

Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human effort and limited inference throughput of policies. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials. Specifically, it reduces the number of evaluation trials by up to 40% as compared to state-of-the-art baselines, while preserving the probabilistic correctness and statistical power of the comparison. Moreover, our method is strongest in the most challenging comparison instances (requiring the most evaluation trials); in a multi-task comparison scenario, we save the evaluator more than 200 simulation rollouts.

Problem

Research questions and friction points this paper is trying to address.

Develops a statistical framework for comparing imitation learning policies with small sample sizes.

Proposes a sequential testing method to adaptively determine the number of evaluation trials.

Reduces evaluation trials by up to 40% while maintaining statistical correctness and power.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential statistical test for policy comparison

Adaptive trial size based on intermediate results

Near-optimal stopping reduces evaluation trials

🔎 Similar Papers

RILe: Reinforced Imitation Learning

2024-06-12arXiv.orgCitations: 0