LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of 3D scene reconstruction from sparse, pose-free multi-view images by proposing an end-to-end feedforward framework that, for the first time, achieves language-aligned 3D semantic reconstruction without requiring camera poses or iterative optimization. Built upon a 3D Gaussian Splatting representation, the method introduces a sparse semantic encoding mechanism that preserves high-level linguistic information while significantly reducing computational complexity. Semantic consistency is further enhanced through a combination of a global semantic dictionary and local adaptive weights. Experiments on the semantically augmented RealEstate10K dataset demonstrate that the proposed approach outperforms existing methods in both novel view synthesis quality and semantic fidelity, while maintaining low inference latency and strong generalization capability.

📝 Abstract

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

Problem

Research questions and friction points this paper is trying to address.

3D reconstruction

language-aligned semantics

sparse unposed images

Gaussian splatting

multimodal scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 3D reconstruction

Language-aligned semantics

Gaussian Splatting