3D Aware Region Prompted Vision Language Model

šŸ“… 2025-09-16
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses three key challenges: (1) the difficulty of jointly modeling single-view 2D images and multi-view 3D data; (2) heavy reliance of cross-frame region prompting on dense per-frame annotations; and (3) the absence of ground-truth 3D inputs in real-world videos. To this end, we propose a unified 2D–3D vision-language model grounded in a shared visual token space. Our method introduces learnable 3D positional encodings to enrich 2D features and integrates multi-view geometric priors with cross-modal attention, enabling alignment and joint reasoning between 2D imagery and 3D space within a common representation. Crucially, it requires no dense multi-frame annotations and supports arbitrary frame-level bounding boxes, masks, or direct 3D spatial prompts. This significantly improves accuracy in cross-view spatial relationship modeling and metric 3D reasoning. The model achieves state-of-the-art performance across multiple 2D vision-language and 3D spatial understanding benchmarks, and generalizes effectively to real-world videos lacking ground-truth 3D supervision.

Technology Category

Application Category

šŸ“ Abstract
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.
Problem

Research questions and friction points this paper is trying to address.

Bridging 2D images and 3D data via shared visual token space
Enabling flexible 3D region annotation without multi-frame labeling
Enhancing spatial reasoning across frames using 3D positional embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D positional embeddings enrich 2D features
Flexible region prompting with various annotations
Unifies 2D and 3D representation space
šŸ”Ž Similar Papers
No similar papers found.