Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the challenges of coupling 3D reconstruction with open-vocabulary semantic understanding from pose-free multi-view images—namely, poor generalization and decoupled modeling—this paper proposes the first feed-forward unified framework. It employs a cross-view Transformer for robust feature alignment and integrates semantics-enhanced 3D Gaussian splatting to construct a generalizable implicit scene representation. The framework enables end-to-end joint optimization of novel view synthesis, 3D semantic segmentation, and depth prediction without per-scene fine-tuning. Crucially, it innovatively embeds open-vocabulary semantic alignment directly into 3D Gaussian primitives, enabling tight geometric-semantic co-modeling. Evaluated on RE10K and ScanNet, the method achieves state-of-the-art PSNR of 25.07 and mIoU of 55.84, respectively, surpassing existing approaches across multiple metrics.

Technology Category

Application Category

📝 Abstract

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from sparse 2D views

Integrating semantic understanding with 3D reconstruction

Eliminating costly per-scene optimization for scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizable Gaussian Splatting for 3D reconstruction

Cross-View Transformer for multi-view integration

Unified 3D Gaussian primitives with semantic features

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View