TIGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges of 3D feature extraction and cross-modal alignment among text, images, and 3D data by proposing a spatially aware multimodal alignment framework. The method introduces a novel multi-branch tokenizer that decouples the 3D Gaussian Splatting (3DGS) representation into compact latent codes, effectively integrating multi-view features with diffusion priors to mitigate view ambiguity. Furthermore, a text-to-3D adaptive projection module is designed to enable fine-grained cross-modal alignment. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance across multiple benchmark datasets in tasks including cross-modal retrieval, zero-shot classification, and scene recognition.

Technology Category

Application Category

📝 Abstract

While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.

Problem

Research questions and friction points this paper is trying to address.

3D modality

cross-modal alignment

text-image-3D

feature extraction

modality gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

cross-modal alignment

multi-branch tokenizer