FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of reconstructing language-embedded 3D Gaussian representations from arbitrary uncalibrated multi-view images—either sparse or dense—without any 3D supervision. We propose the first unsupervised 2D-to-3D lifting framework featuring: (i) instance-guided contrastive learning to explicitly align fine-grained linguistic representations between 2D image semantics and 3D Gaussian voxels; (ii) a geometry-semantic hierarchical sparsification mechanism that drastically reduces computational cost while preserving reconstruction fidelity; and (iii) multi-view self-supervised joint optimization enforcing consistency across geometry, appearance, and language. Evaluated on standard benchmarks, our method achieves state-of-the-art performance in reconstruction accuracy, rendering fidelity, and language–3D semantic alignment. It further enables scalable, video-driven semantic 3D reconstruction.

Technology Category

Application Category

📝 Abstract
We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs language-embedded 3D Gaussians from any views
Trains without 3D annotations using unposed multi-view images
Aligns 2D semantics with 3D representations via contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward network reconstructs language-embedded 3D Gaussians
3D-annotation-free training from uncalibrated multi-view images
Instance-guided contrastive learning aligns 2D semantics with 3D
🔎 Similar Papers
No similar papers found.
Q
Qijian Tian
Shanghai Jiao Tong University
X
Xin Tan
East China Normal University, Shanghai Artificial Intelligence Laboratory
J
Jiayu Ying
East China Normal University
Xuhong Wang
Xuhong Wang
Shanghai Artificial Intelligence Laboratory
LLMKnowledge SystemAI Simulation
Y
Yuan Xie
East China Normal University
L
Lizhuang Ma
Shanghai Jiao Tong University