🤖 AI Summary
This work addresses the challenge of open-vocabulary 3D scene understanding from unconstrained Internet-sourced architectural photographs. To this end, we propose a unified framework integrating language embeddings with 3D Gaussian splatting rendering. Our method introduces a transient uncertainty-aware autoencoder to jointly model multi-view CLIP-based appearance features and a language field; incorporates a feature-uncertainty-guided optimization strategy; and employs a post-fusion mechanism to enhance language feature compression and cross-modal alignment. Evaluated on our newly constructed PT-OVS dataset, the approach enables open-vocabulary 3D segmentation, interactive 3D navigation, architectural style recognition, and scene editing. It significantly advances zero-shot 3D semantic understanding—achieving scalable, interactive architectural knowledge modeling without requiring task-specific annotations or fine-tuning.
📝 Abstract
Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.