Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of efficiently associating high-dimensional vision-language features with millions of 3D Gaussian points to enable open-vocabulary 3D scene understanding. The proposed method, SCOUP, decouples language representation learning from 3D Gaussian optimization by first learning a sparse codebook from 2D image regions, then lifting semantic coefficients into 3D space through weighted sparse aggregation and multi-view Top-K selection. This approach uniquely unifies rapid semantic reconstruction, low memory footprint, and high rendering speed—overcoming limitations of existing methods that rely on dense storage or per-scene optimization. Experiments demonstrate up to a 400× acceleration in training, 3× higher memory efficiency, faster rendering than state-of-the-art alternatives, and competitive or superior open-vocabulary query accuracy across multiple benchmarks.

📝 Abstract

3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

Problem

Research questions and friction points this paper is trying to address.

3D Language Gaussian Splatting

vision-language embeddings

efficient storage

fast rendering

open-vocabulary 3D scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Code Uplifting

3D Language Gaussian Splatting

Open-vocabulary 3D Understanding