🤖 AI Summary
To address the weak generalization of few-shot (20 demonstrations) imitation learning in partially observable environments, this paper proposes a novel demonstration-driven framework for mobile manipulation tasks. Methodologically, it introduces—first in the literature—a vision foundation model–based (e.g., CLIP) demonstration snippet retrieval mechanism that matches observations via visual similarity; integrates trajectory similarity and forward-reachability constraints to filter feasible subgoals; and employs a goal-conditioned diffusion-based motion policy for action generation. The core contribution lies in abandoning end-to-end fitting in favor of a modular, interpretable, and verifiable architecture that ensures subgoal feasibility and policy robustness. Evaluated on both simulation and real-world Spot robot platforms, the framework achieves significantly higher success rates than state-of-the-art baselines: 85%/80% (sim/real) for gap coverage, 87.5%/70% for tabletop cleanup, and 47.5%/35% for curtain opening.
📝 Abstract
Imitation learning (IL) algorithms typically distil experience into parametric behavior policies to mimic expert demonstrations. With limited experience previous methods often struggle and cannot accurately align the current state with expert demonstrations, particularly in tasks that are characterised by partial observations or dynamic object deformations. We consider imitation learning in deformable mobile manipulation with an ego-centric limited field of view and introduce a novel IL approach called DeMoBot that directly retrieves observations from demonstrations. DeMoBot utilizes vision foundation models to identify relevant expert data based on visual similarity and matches the current trajectory with demonstrated trajectories using trajectory similarity and forward reachability constraints to select suitable sub-goals. A goal-conditioned motion generation policy shall guide the robot to the sub-goal until the task is completed. We evaluate DeMoBot using a Spot robot in several simulated and real-world settings, demonstrating its effectiveness and generalizability. DeMoBot outperforms baselines with only 20 demonstrations, attaining high success rates in gap covering (85% simulation, 80% real-world) and table uncovering (87.5% simulation, 70% real-world), while showing promise in complex tasks like curtain opening (47.5% simulation, 35% real-world). Additional details are available at: https://sites.google.com/view/demobot-fewshot/home