DeMoBot: Deformable Mobile Manipulation with Vision-based Sub-goal Retrieval

📅 2024-08-28

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

216K/year

🤖 AI Summary

To address the weak generalization of few-shot (20 demonstrations) imitation learning in partially observable environments, this paper proposes a novel demonstration-driven framework for mobile manipulation tasks. Methodologically, it introduces—first in the literature—a vision foundation model–based (e.g., CLIP) demonstration snippet retrieval mechanism that matches observations via visual similarity; integrates trajectory similarity and forward-reachability constraints to filter feasible subgoals; and employs a goal-conditioned diffusion-based motion policy for action generation. The core contribution lies in abandoning end-to-end fitting in favor of a modular, interpretable, and verifiable architecture that ensures subgoal feasibility and policy robustness. Evaluated on both simulation and real-world Spot robot platforms, the framework achieves significantly higher success rates than state-of-the-art baselines: 85%/80% (sim/real) for gap coverage, 87.5%/70% for tabletop cleanup, and 47.5%/35% for curtain opening.

Technology Category

Application Category

📝 Abstract

Imitation learning (IL) algorithms typically distil experience into parametric behavior policies to mimic expert demonstrations. With limited experience previous methods often struggle and cannot accurately align the current state with expert demonstrations, particularly in tasks that are characterised by partial observations or dynamic object deformations. We consider imitation learning in deformable mobile manipulation with an ego-centric limited field of view and introduce a novel IL approach called DeMoBot that directly retrieves observations from demonstrations. DeMoBot utilizes vision foundation models to identify relevant expert data based on visual similarity and matches the current trajectory with demonstrated trajectories using trajectory similarity and forward reachability constraints to select suitable sub-goals. A goal-conditioned motion generation policy shall guide the robot to the sub-goal until the task is completed. We evaluate DeMoBot using a Spot robot in several simulated and real-world settings, demonstrating its effectiveness and generalizability. DeMoBot outperforms baselines with only 20 demonstrations, attaining high success rates in gap covering (85% simulation, 80% real-world) and table uncovering (87.5% simulation, 70% real-world), while showing promise in complex tasks like curtain opening (47.5% simulation, 35% real-world). Additional details are available at: https://sites.google.com/view/demobot-fewshot/home

Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot imitation learning for mobile manipulation tasks

Solves partial observability issues in vision-based robot control

Enables generalization across positions, sizes, and material types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-based imitation learning from few demonstrations

Vision foundation models for visual similarity assessment

Motion selection policy for robot command generation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey