🤖 AI Summary
Discriminative methods for vision-based 3D occupancy prediction in autonomous driving exhibit limitations in robustness to sensor noise, consistency in occluded regions, and preservation of 3D structural integrity. Method: This paper introduces diffusion models—the first generative approach to 3D occupancy modeling—explicitly learning the underlying 3D scene data distribution and geometric priors to enhance robustness against incomplete observations and sensor noise. Our architecture employs a 3D convolutional encoder-decoder that fuses multi-view image features and generates voxel-wise occupancy probabilities via iterative denoising. Results: On benchmarks including nuScenes, our method significantly outperforms state-of-the-art discriminative approaches, particularly in occluded and low-visibility regions. Moreover, it improves downstream motion planning success rate by 12.7%, demonstrating superior generalization and geometric fidelity in real-world driving scenarios.
📝 Abstract
Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.