HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the challenges of jointly executing long-range navigation and fine manipulation, low sample efficiency, and poor generalization in mobile manipulation tasks within realistic home environments, this paper proposes an end-to-end imitation learning framework integrating whole-body motion control with a hybrid action space. Key contributions include: (1) a novel kinematics-based holistic controller enabling dynamic switching between absolute and relative pose control; (2) zero-shot task generalization achieved via integration with vision-language models; and (3) empirical validation on a hardware platform integrating a 7-DoF manipulator with an omnidirectional mobile base. With only 20 demonstration trajectories per task, the method achieves a mean success rate of 79.17% across three real-world household tasks—outperforming the best baseline by 29.17 percentage points—and demonstrates significantly improved robustness to cluttered scenes and previously unseen object appearances.

Technology Category

Application Category

📝 Abstract

We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: http://homer-manip.github.io

Problem

Research questions and friction points this paper is trying to address.

Combining whole-body control with hybrid action modes for mobile manipulation

Learning to switch between long-range and fine-grained motion in real-world tasks

Enhancing generalization to novel object appearances and cluttered scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid imitation learning for mobile manipulation

Whole-body control for coordinated motion

Vision-language model integration for generalization

🔎 Similar Papers

Whole-Body Teleoperation for Mobile Manipulation at Zero Added Cost

2024-09-23Citations: 1