GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of dense manual annotation requirements and computationally expensive voxel-based representations in 3D semantic occupancy prediction, this paper proposes a Gaussian-representation-driven Transformer framework. It employs sparse Gaussian sets as forward inputs to jointly reconstruct scene geometry and semantics, integrating Gaussian splatting rendering with vision foundation models (VFMs) for cross-modal feature alignment. This establishes the first self-supervised paradigm bridging Gaussian representations and VFMs. The method eliminates reliance on voxel discretization and human-annotated labels, enabling open-vocabulary semantic occupancy prediction. Evaluated on Occ3D-nuScenes, it achieves zero-shot mIoU of 12.27, reduces training time by 40%, and significantly improves both generalization capability and computational efficiency.

Technology Category

Application Category

📝 Abstract
3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents. The code is available at https://github.com/hustvl/GaussTR.
Problem

Research questions and friction points this paper is trying to address.

Improving 3D semantic occupancy prediction scalability and generalization
Reducing reliance on labeled data with self-supervised learning
Enabling open-vocabulary semantic occupancy without explicit annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian-based Transformer for 3D modeling
Self-supervised learning via foundation model alignment
Sparse Gaussian representations reduce training time
🔎 Similar Papers
No similar papers found.