GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing methods struggle to learn fine-grained, language-aware 3D representations from 2D images, hindering open-vocabulary 3D scene understanding. To address this, we propose a language-aligned 3D Gaussian lattice representation framework. Our method introduces a cross-attention module with two learnable codebooks to achieve view-invariant semantic embedding—eliminating per-Gaussian feature storage and substantially reducing memory overhead. Furthermore, we integrate self-supervised contrastive distillation with language-guided attention to construct a unified 2D/3D joint query space. Evaluated on real-world scene datasets, our approach achieves the first high-accuracy open-vocabulary 2D/3D recognition with cross-view semantic consistency. It significantly advances cross-modal alignment performance, outperforming prior methods in both 2D–3D correspondence and language–geometry grounding.

Technology Category

Application Category

📝 Abstract

3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary 3D scene understanding from 2D images

Language-aware 3D representation learning with fine details

Memory-efficient open-vocabulary queries for 2D and 3D

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised contrastive learning for feature distillation

Cross-attention module with learnable semantic codebooks

Memory-efficient open-vocabulary 2D and 3D queries

🔎 Similar Papers

No similar papers found.