🤖 AI Summary
This work addresses the performance degradation of conventional behavioral cloning when test-time environment dynamics diverge from those encountered during training, often leading agents to execute unreliable actions. The paper introduces the first selective imitation learning framework capable of handling arbitrary dynamics shifts, enabling agents to actively abstain from acting under uncertainty. Leveraging expert demonstrations from the training environment and unlabeled trajectories from the test environment, the method constructs a time-horizon-independent ensemble of verifiers and employs the SeqRejectron algorithm to derive stopping rules: it utilizes a sparse-cost assumption for deterministic policies and cumulative Hellinger distance for stochastic policies. Theoretical analysis establishes horizon-independent sample complexity bounds for both policy classes and demonstrates robustness against expert policy misspecification and train-test distributional mismatches.
📝 Abstract
Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|Π|/ε^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.