C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately grounding natural language spatial instructions—particularly those involving complex spatial relations (e.g., distance, geometric constraints, inter-object relative positioning)—in visual scenes. We propose a coarse-to-fine two-stage framework: first, a vision-language model (VLM) leverages grid-based visual localization prompts and a propose-verify mechanism to generate an initial region that is both semantically and physically consistent; second, superpixel segmentation and local adaptive refinement enable fine-grained spatial alignment. To our knowledge, this is the first systematic effort to harness VLMs’ spatial reasoning capabilities for spatial grounding. Evaluated on a newly constructed benchmark, our method significantly outperforms five state-of-the-art baselines in both IoU and success rate. Ablation studies confirm the efficacy of each component, and the approach successfully transfers to simulated robotic pick-and-place tasks.

Technology Category

Application Category

📝 Abstract
Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses spatial grounding limitations in natural language instructions
Overcomes coarse output issues in vision-language model localization
Refines spatial regions through superpixel-based environmental alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine framework with grid-based VLM prompting
Superpixelization refines regions to local environment
Propose-validate strategy ensures spatial semantic consistency
🔎 Similar Papers
No similar papers found.
N
Nayoung Oh
Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea.
Dohyun Kim
Dohyun Kim
Yonsei University College of Dentistry
Restorative DentistryDental MaterialsDental Hard TissuesTraumatic Dental Injuries
J
Junhyeong Bang
Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea.
Rohan Paul
Rohan Paul
Indian Institute of Technology Delhi, Hauz Khas, India.
Daehyung Park
Daehyung Park
Associate Professor, KAIST
roboticsmanipulationmachine learning