Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-vocabulary mobile manipulation (OVMM), robot base mislocalization frequently leads to task failure due to overreliance on local geometric proximity for navigation. Method: This paper proposes a semantic-geometric co-guided “coarse-to-fine” exploration framework that jointly models visual-language model (VLM) semantic priors and geometric feasibility—departing from conventional geometry-only approaches. It constructs cross-modal representations: an Affordance RGB map for semantic-aware global search and an Obstacle Map+ for geometric reasoning, enabling VLM-informed navigation without task-specific training. Subsequently, geometry-constrained iterative refinement achieves task-adaptive, zero-shot base positioning. Contribution/Results: Evaluated on five OVMM tasks, the method achieves 85% success rate—substantially outperforming pure geometric planners and standard VLM baselines—demonstrating the critical role of semantic awareness and multimodal joint reasoning in generalizable, instruction-driven manipulation planning.

Technology Category

Application Category

📝 Abstract
In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal robot base placement for manipulation tasks
Overcoming limitations of proximity-based navigation without affordances
Integrating semantic understanding with geometric feasibility for planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine exploration integrating semantic and geometric reasoning
Cross-modal representations aligning affordance semantics with spatial context
Iterative optimization guided by affordance priors and geometric constraints
T
Tzu-Jung Lin
National Taiwan University
Jia-Fong Yeh
Jia-Fong Yeh
PhD at National Taiwan University
Robot LearningReinforcement LearningComputer VisionEvolutionary Computation
Hung-Ting Su
Hung-Ting Su
National Taiwan University
Natural Language ProcessingComputer VisionMachine LearningMultimedia
C
Chung-Yi Lin
National Taiwan University
Y
Yi-Ting Chen
National Taiwan University, National Yang Ming Chiao Tung University
W
Winston H. Hsu
National Taiwan University