UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing methods for 3D reconstruction from sparse, pose-free images suffer from geometric instability, poor depth quality, and limited open-vocabulary 3D semantic generalization. To address these challenges, this work proposes UniSem, a unified framework built upon 3D Gaussian Splatting (3DGS). UniSem introduces an Error-aware Gaussian Dropping (EGD) mechanism to enhance depth accuracy and incorporates a Mixed Training Curriculum (MTC) that jointly leverages 2D semantic supervision and 3D object-level prototype alignment to improve semantic generalization. Experiments on ScanNet and Replica demonstrate that with only 16 input views, UniSem reduces the relative depth error by 15.2% and improves open-vocabulary 3D segmentation mean accuracy by 3.7%, significantly enhancing both geometric stability and semantic completeness.

Technology Category

Application Category

📝 Abstract

Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

semantic 3D reconstruction

sparse unposed images

3D Gaussian Splatting

depth accuracy

semantic generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

semantic 3D reconstruction

error-aware dropout