A Smooth Sea Never Made a Skilled $ exttt{SAILOR}$: Robust Imitation via Learning to Search

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Behavior cloning (BC) exhibits poor recovery capability under out-of-distribution states, struggling to handle failure scenarios deviating from expert demonstrations. This paper introduces Learning-to-Search (L2S), the first framework for visual imitation learning that jointly learns a world model and a reward model, enabling agents to autonomously perform online goal-conditioned planning at test time for unsupervised, robust recovery. L2S eliminates reliance on additional human interventions or dense supervision, elevating imitation learning from “giving a fish” to “teaching how to fish.” Evaluated across 12 visual manipulation tasks on three major benchmarks, L2S significantly outperforms diffusion-based BC policies. Crucially, this advantage persists even when BC training data is scaled by 5–10×. Moreover, L2S provides explicit failure detection and inherent robustness against reward hacking.

Technology Category

Application Category

📝 Abstract
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $ exttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$ imes$ still leaves a performance gap. We find that $ exttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .
Problem

Research questions and friction points this paper is trying to address.

Overcoming BC's limitation in handling unseen states
Learning recovery behavior without human corrections
Improving robustness and performance over Diffusion Policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning to search from expert demonstrations
Combining world model and reward model
Robust imitation without human corrections
🔎 Similar Papers
No similar papers found.