🤖 AI Summary
In imitation learning, policy evaluation is often constrained by small sample sizes (e.g., n = 10–50), leading to rigid and inefficient comparisons.
Method: This paper introduces the first adaptive sequential policy comparison framework tailored for robotic manipulation tasks. It integrates sequential hypothesis testing, confidence sequence analysis, and optimal stopping theory to jointly optimize statistical rigor—controlling both Type I and Type II errors—and sampling efficiency.
Contribution/Results: The framework dynamically prunes redundant trials in the most challenging-to-distinguish scenarios, reducing real-world experiment counts by up to 40% over state-of-the-art batch testing methods while preserving statistical power. In multi-task simulation benchmarks, it saves over 200 rollouts. To our knowledge, this is the first application of near-optimal sequential testing to small-sample imitation learning policy evaluation, breaking the rigidity of fixed-sample paradigms.
📝 Abstract
Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human effort and limited inference throughput of policies. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials. Specifically, it reduces the number of evaluation trials by up to 40% as compared to state-of-the-art baselines, while preserving the probabilistic correctness and statistical power of the comparison. Moreover, our method is strongest in the most challenging comparison instances (requiring the most evaluation trials); in a multi-task comparison scenario, we save the evaluator more than 200 simulation rollouts.