🤖 AI Summary
In mobile manipulation, a misalignment exists between navigation and manipulation policies regarding initial pose selection: navigation aims merely to reach the task region, whereas manipulation is highly sensitive to the starting pose. To address this, we propose a lightweight, egocentric transition module that, upon reaching the task region, infers and guides the robot to an operation-friendly initial pose in real time. The module operates solely on a single-frame visual observation—requiring no global localization, temporal history, or explicit geometric modeling—and learns pose preferences via rollout data, ensuring strong environmental adaptability, cross-task and cross-platform generalization, and exceptional data efficiency. On the PnPCounterToCab task, success rate improves dramatically from 3% to 54%. With only 15 unseen scene samples, the module achieves stable prediction performance, validating its effectiveness in both simulation and real-world robotic platforms.
📝 Abstract
In mobile manipulation, the manipulation policy has strong preferences for initial poses where it is executed. However, the navigation module focuses solely on reaching the task area, without considering which initial pose is preferable for downstream manipulation. To address this misalignment, we introduce N2M, a transition module that guides the robot to a preferable initial pose after reaching the task area, thereby substantially improving task success rates. N2M features five key advantages: (1) reliance solely on ego-centric observation without requiring global or historical information; (2) real-time adaptation to environmental changes; (3) reliable prediction with high viewpoint robustness; (4) broad applicability across diverse tasks, manipulation policies, and robot hardware; and (5) remarkable data efficiency and generalizability. We demonstrate the effectiveness of N2M through extensive simulation and real-world experiments. In the PnPCounterToCab task, N2M improves the averaged success rate from 3% with the reachability-based baseline to 54%. Furthermore, in the Toybox Handover task, N2M provides reliable predictions even in unseen environments with only 15 data samples, showing remarkable data efficiency and generalizability.