GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead and low computational efficiency of voxel-based representations in 3D semantic occupancy prediction, this paper proposes the first LiDAR-camera multimodal fusion framework built upon 3D Gaussian point clouds. Methodologically: (1) it replaces dense voxels with compact, continuous, object-centric Gaussian primitives; (2) introduces a novel geometry-aware voxel-to-Gaussian initialization strategy to ensure structural plausibility; and (3) designs a LiDAR-guided 3D deformable attention mechanism for efficient cross-modal feature alignment and aggregation. Evaluated on both road and off-road datasets, our approach achieves state-of-the-art accuracy while reducing memory consumption by 37% and accelerating inference by 2.1× compared to prior methods. These improvements significantly enhance the efficiency and practicality of 3D semantic modeling for autonomous driving applications.

Technology Category

Application Category

📝 Abstract
3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and detailed predictions. Although most existing works utilize a dense grid-based representation, in which the entire 3D space is uniformly divided into discrete voxels, the emergence of 3D Gaussians provides a compact and continuous object-centric representation. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, named as GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy to provide 3D Gaussians with geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism for refining 3D Gaussians with LiDAR-camera fusion features in a lifted 3D space. We conducted extensive experiments on both on-road and off-road datasets, demonstrating that our GaussianFormer3D achieves high prediction accuracy that is comparable to state-of-the-art multi-modal fusion-based methods with reduced memory consumption and improved efficiency.
Problem

Research questions and friction points this paper is trying to address.

Improving 3D semantic occupancy prediction for autonomous driving
Enhancing multi-modal LiDAR-camera fusion accuracy and detail
Reducing memory usage while maintaining high prediction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Gaussian-based semantic occupancy prediction
Voxel-to-Gaussian initialization with LiDAR priors
LiDAR-guided 3D deformable attention for fusion
🔎 Similar Papers
No similar papers found.
L
Lingjun Zhao
Georgia Institute of Technology, Atlanta, GA USA
Sizhe Wei
Sizhe Wei
Georgia Institute of Technology
Robotics
James Hays
James Hays
Georgia Tech
Computer VisionRoboticsMachine LearningAI
L
Lu Gan
Georgia Institute of Technology, Atlanta, GA USA