Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Mobile manipulators require centimeter-level end-effector positioning prior to manipulation, yet existing RGB-based navigation achieves only meter-level accuracy, leading to policy generalization failure. To address this, we propose the first language-driven, monocular RGB-only category-level last-meter navigation framework—requiring no depth/LiDAR sensors, geometric maps, or pre-built environment models, and enabling cross-object and cross-environment generalization from a single instance demonstration. Our method integrates object-centric imitation learning, language-guided segmentation, spatial score matrix decoding, and multi-view RGB fusion, with the navigation policy conditioned on textual prompts. Real-world experiments demonstrate 73.47% edge-alignment success and 96.94% orientation-alignment accuracy for unseen target objects—constituting the first empirical validation of purely vision-based, category-level, centimeter-accurate localization.

Technology Category

Application Category

📝 Abstract

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 73.47% success in edge-alignment and 96.94% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/

Problem

Research questions and friction points this paper is trying to address.

Enables precise robot positioning using only RGB camera observations

Solves last-meter navigation for manipulation without depth sensors

Generalizes across object instances using single-demonstration learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric imitation learning for last-meter navigation

Language-driven segmentation and spatial score-matrix decoder

Generalizes from single instance to category-level using only RGB

🔎 Similar Papers

A Landmark-Aware Visual Navigation Dataset