VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of multi-view indoor 3D object detection without access to sensor-derived geometric information such as camera poses or depth. It presents the first detection framework specifically designed for scenarios lacking explicit geometric inputs. The proposed method leverages a Visual Geometry Grounded Transformer (VGGT) and introduces two key mechanisms: attention-guided query generation (AG) and query-driven feature aggregation (QD), which effectively exploit the implicit semantic and geometric priors embedded within the VGGT architecture to enable robust 3D perception. Evaluated on ScanNet and ARKitScenes, the approach significantly outperforms existing state-of-the-art methods by 4.4 and 8.6 mAP@0.25, respectively, demonstrating its superior detection performance under geometry-free conditions.

Technology Category

Application Category

📝 Abstract
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
Problem

Research questions and friction points this paper is trying to address.

multi-view
indoor 3D object detection
sensor-geometry-free
camera pose
3D detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensor-Geometry-Free
VGGT-Det
Attention-Guided Query Generation
Query-Driven Feature Aggregation
multi-view 3D object detection
🔎 Similar Papers
No similar papers found.