Efficient Imitation Under Misspecification

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Interactive imitation learning (IL) suffers from compounding errors and low sample efficiency when expert policies are unrealizable—e.g., due to state/action space mismatches arising from morphological disparities between human demonstrators and robotic agents. Method: We introduce “reward-agnostic policy completeness,” a novel structural condition that, for the first time under unrealizable settings, theoretically guarantees avoidance of compounding error in interactive IL. Building on this, we propose a hybrid optimization framework integrating limited expert demonstrations with auxiliary offline data to improve sample efficiency. Results: Evaluated on continuous-control benchmarks, our method significantly outperforms offline baselines such as behavioral cloning. Moreover, it provides the first systematic empirical evidence that the choice of optimal reset distribution critically governs performance in misspecified scenarios—highlighting its pivotal role in robust interactive IL.

Technology Category

Application Category

📝 Abstract

Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert's policy lies within the learner's policy class (i.e. the learner can perfectly imitate the expert in all states). However, in practice, perfect imitation of the expert is often impossible due to differences in state information and action space expressiveness (e.g. morphological differences between robots and humans.) In this paper, we consider the more general misspecified setting, where no assumptions are made about the expert policy's realizability. We introduce a novel structural condition, reward-agnostic policy completeness, and prove that it is sufficient for interactive IL algorithms to efficiently avoid the quadratically compounding errors that stymie offline approaches like behavioral cloning. We address an additional practical constraint-the case of limited expert data-and propose a principled method for using additional offline data to further improve the sample-efficiency of interactive IL algorithms. Finally, we empirically investigate the optimal reset distribution in efficient IL under misspecification with a suite of continuous control tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses imitation learning under policy misspecification.

Introduces reward-agnostic policy completeness for efficient learning.

Proposes methods to improve sample-efficiency with limited expert data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reward-agnostic policy completeness condition

Proposes method for using offline data efficiently

Investigates optimal reset distribution in IL

🔎 Similar Papers

RILe: Reinforced Imitation Learning