Sparse Multiview Open-Vocabulary 3D Detection

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-vocabulary 3D object detection from sparse multi-view RGB images, proposing a novel paradigm that requires neither 3D annotations nor end-to-end training. Methodologically, it first leverages off-the-shelf 2D foundation models to generate open-vocabulary 2D detections; then employs geometry-guided 2D-to-3D lifting to obtain initial 3D proposals; and finally refines these proposals via a cross-view feature consistency optimization—aligning multi-view features in 3D space and re-scoring proposals without additional training. The core contribution lies in decoupling 2D semantic generalization from 3D geometric reasoning, thereby eliminating reliance on dense viewpoint coverage, 3D supervision, or fixed-category training. Experiments demonstrate that the method achieves performance on par with state-of-the-art approaches under dense-view settings, while significantly outperforming existing methods under sparse-view conditions—establishing a strong, annotation-free baseline for open-vocabulary 3D detection.

Technology Category

Application Category

📝 Abstract
The ability to interpret and comprehend a 3D scene is essential for many vision and robotics systems. In numerous applications, this involves 3D object detection, i.e.~identifying the location and dimensions of objects belonging to a specific category, typically represented as bounding boxes. This has traditionally been solved by training to detect a fixed set of categories, which limits its use. In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting, where only a limited number of posed RGB images are available as input. Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion or requiring 3D-specific learning. By lifting 2D detections and directly optimizing 3D proposals for featuremetric consistency across views, we fully leverage the extensive training data available in 2D compared to 3D. Through standard benchmarks, we demonstrate that this simple pipeline establishes a powerful baseline, performing competitively with state-of-the-art techniques in densely sampled scenarios while significantly outperforming them in the sparse-view setting.
Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary 3D object detection
Sparse multiview input images
Training-free approach using 2D models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free open-vocabulary 3D detection
Lifting 2D detections to 3D proposals
Featuremetric consistency optimization across views
🔎 Similar Papers
No similar papers found.