🤖 AI Summary
Existing 3D occupancy prediction methods for autonomous driving suffer from high computational overhead and difficulty in modeling temporal dynamics. To address these issues, this paper proposes a streaming sparse Gaussian representation framework. Our method replaces dense voxels or fixed Gaussian distributions with propagatable sparse 3D queries, enabling lightweight, online, and temporally consistent semantic-geometric representation. We introduce a denoising rendering loss to jointly optimize query positions and Gaussian parameters, and design a streaming temporal propagation mechanism for efficient dynamic scene modeling. Evaluated on nuScenes and KITTI benchmarks, our approach achieves state-of-the-art performance, improving IoU by 1.5 percentage points and accelerating inference by 5.9× over GaussianWorld and related methods.
📝 Abstract
Despite the demonstrated efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy prediction methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Owing to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 1.5 IoU with 5.9x faster inference.