🤖 AI Summary
Existing methods struggle to learn fine-grained, language-aware 3D representations from 2D images, hindering open-vocabulary 3D scene understanding. To address this, we propose a language-aligned 3D Gaussian lattice representation framework. Our method introduces a cross-attention module with two learnable codebooks to achieve view-invariant semantic embedding—eliminating per-Gaussian feature storage and substantially reducing memory overhead. Furthermore, we integrate self-supervised contrastive distillation with language-guided attention to construct a unified 2D/3D joint query space. Evaluated on real-world scene datasets, our approach achieves the first high-accuracy open-vocabulary 2D/3D recognition with cross-view semantic consistency. It significantly advances cross-modal alignment performance, outperforming prior methods in both 2D–3D correspondence and language–geometry grounding.
📝 Abstract
3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.