🤖 AI Summary
Behavior cloning (BC) exhibits poor recovery capability under out-of-distribution states, struggling to handle failure scenarios deviating from expert demonstrations. This paper introduces Learning-to-Search (L2S), the first framework for visual imitation learning that jointly learns a world model and a reward model, enabling agents to autonomously perform online goal-conditioned planning at test time for unsupervised, robust recovery. L2S eliminates reliance on additional human interventions or dense supervision, elevating imitation learning from “giving a fish” to “teaching how to fish.” Evaluated across 12 visual manipulation tasks on three major benchmarks, L2S significantly outperforms diffusion-based BC policies. Crucially, this advantage persists even when BC training data is scaled by 5–10×. Moreover, L2S provides explicit failure detection and inherent robustness against reward hacking.
📝 Abstract
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $ exttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$ imes$ still leaves a performance gap. We find that $ exttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .