C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This work addresses the challenge of accurately grounding natural language spatial instructions—particularly those involving complex spatial relations (e.g., distance, geometric constraints, inter-object relative positioning)—in visual scenes. We propose a coarse-to-fine two-stage framework: first, a vision-language model (VLM) leverages grid-based visual localization prompts and a propose-verify mechanism to generate an initial region that is both semantically and physically consistent; second, superpixel segmentation and local adaptive refinement enable fine-grained spatial alignment. To our knowledge, this is the first systematic effort to harness VLMs’ spatial reasoning capabilities for spatial grounding. Evaluated on a newly constructed benchmark, our method significantly outperforms five state-of-the-art baselines in both IoU and success rate. Ablation studies confirm the efficacy of each component, and the approach successfully transfers to simulated robotic pick-and-place tasks.

Technology Category

Application Category

📝 Abstract
Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses spatial grounding limitations in natural language instructions
Overcomes coarse output issues in vision-language model localization
Refines spatial regions through superpixel-based environmental alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine framework with grid-based VLM prompting
Superpixelization refines regions to local environment
Propose-validate strategy ensures spatial semantic consistency