BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing open-vocabulary 3D detection methods rely on dense point cloud reconstruction, incurring prohibitive computational and memory overhead, thus hindering real-time deployment. This paper proposes the first reconstruction-free online open-vocabulary 3D detection framework that directly processes streaming RGB-D video: it extracts 3D detection proposals per frame and injects CLIP-based semantic embeddings; achieves inter-frame association via 3D non-maximum suppression and multi-view matching; and refines fused detections using an IoU-guided particle filter. Our core innovations are (i) a “reconstruction-free paradigm” eliminating explicit 3D geometry reconstruction, and (ii) an efficient stochastic optimization mechanism for fusion, ensuring multi-view consistency while drastically improving runtime efficiency. Evaluated on ScanNetV2 and CA-1M, our method achieves state-of-the-art performance among online approaches, enabling edge-deployable real-time detection (>15 FPS) over large-scale scenes exceeding 1000 m².

Technology Category

Application Category

📝 Abstract

Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary 3D object detection without dense reconstruction

Real-time multi-view box fusion for memory efficiency

Minimizing computational overhead in autonomous systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstruction-free online 3D detection framework

Multi-view box fusion with association and optimization

Real-time open-vocabulary 3D object detection

🔎 Similar Papers

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection