Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-vocabulary semantic 3D reconstruction—a core challenge in spatial AI—by proposing CLIP3R, the first framework to directly embed CLIP’s semantic features into an end-to-end 3D reconstruction pipeline. CLIP3R introduces a 2D–3D open-vocabulary semantic alignment module that jointly optimizes geometry and semantics. Its methodology integrates CLIP-guided reconstruction, multi-scale 2D–3D feature enhancement, dense point-map prediction, and open-vocabulary semantic segmentation, enabling fine-grained semantic alignment and globally consistent geometric modeling. Evaluated on dense 3D reconstruction and open-vocabulary 3D semantic segmentation, CLIP3R achieves state-of-the-art performance, with significant improvements in geometric accuracy and semantic consistency. The framework establishes a new paradigm for real-time, generalizable semantic spatial intelligence.

Technology Category

Application Category

📝 Abstract
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary semantic 3D reconstruction from RGB videos
Integrating CLIP semantics into 3D reconstruction process
Achieving globally consistent geometry and semantic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-informed 3D reconstruction module
2D-3D open-vocabulary semantic module
Integrates CLIP semantics into reconstruction
🔎 Similar Papers
No similar papers found.