TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of 3D feature extraction and cross-modal alignment among text, images, and 3D data by proposing a spatially aware multimodal alignment framework. The method introduces a novel multi-branch tokenizer that decouples the 3D Gaussian Splatting (3DGS) representation into compact latent codes, effectively integrating multi-view features with diffusion priors to mitigate view ambiguity. Furthermore, a text-to-3D adaptive projection module is designed to enable fine-grained cross-modal alignment. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance across multiple benchmark datasets in tasks including cross-modal retrieval, zero-shot classification, and scene recognition.

Technology Category

Application Category

📝 Abstract
While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.
Problem

Research questions and friction points this paper is trying to address.

3D modality
cross-modal alignment
text-image-3D
feature extraction
modality gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
cross-modal alignment
multi-branch tokenizer
text-image-3D fusion
diffusion priors
🔎 Similar Papers
No similar papers found.
J
Jiarun Liu
Unmanned Vehicle Dept., Cainiao Inc., Alibaba Group, Hangzhou, China
Qifeng Chen
Qifeng Chen
HKUST
Computational PhotographyImage SynthesisGenerative AIAutonomous DrivingEmbodied AI
Yiru Zhao
Yiru Zhao
Alibaba DAMO Academy
Computer Vision
Minghua Liu
Minghua Liu
Hillbot
3D VisionEmbodied AI
Baorui Ma
Baorui Ma
Tsinghua University
S
Sheng Yang
Unmanned Vehicle Dept., Cainiao Inc., Alibaba Group, Hangzhou, China