🤖 AI Summary
This work addresses the high computational cost of existing multi-modal 3D semantic occupancy prediction methods in autonomous driving, which typically rely on dense voxel or bird’s-eye-view (BEV) representations. To this end, the authors propose an efficient modeling approach based on a compact set of semantic 3D Gaussians. A LiDAR Completion Diffuser is introduced to densify sparse LiDAR point clouds, thereby initializing Gaussian anchors, and a Gaussian Anchor Fusion module is designed to enable geometry-aligned cross-modal semantic fusion. By abandoning conventional voxelization and instead integrating 3D Gaussian representations with 2D image features, the method achieves state-of-the-art performance across multiple challenging benchmarks while significantly improving computational efficiency.
📝 Abstract
3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.