Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses open-vocabulary semantic 3D reconstruction—a core challenge in spatial AI—by proposing CLIP3R, the first framework to directly embed CLIP’s semantic features into an end-to-end 3D reconstruction pipeline. CLIP3R introduces a 2D–3D open-vocabulary semantic alignment module that jointly optimizes geometry and semantics. Its methodology integrates CLIP-guided reconstruction, multi-scale 2D–3D feature enhancement, dense point-map prediction, and open-vocabulary semantic segmentation, enabling fine-grained semantic alignment and globally consistent geometric modeling. Evaluated on dense 3D reconstruction and open-vocabulary 3D semantic segmentation, CLIP3R achieves state-of-the-art performance, with significant improvements in geometric accuracy and semantic consistency. The framework establishes a new paradigm for real-time, generalizable semantic spatial intelligence.

Technology Category

Application Category

📝 Abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary semantic 3D reconstruction from RGB videos

Integrating CLIP semantics into 3D reconstruction process

Achieving globally consistent geometry and semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-informed 3D reconstruction module

2D-3D open-vocabulary semantic module

Integrates CLIP semantics into reconstruction

🔎 Similar Papers

No similar papers found.