🤖 AI Summary
To address key challenges in vision-driven 3D semantic occupancy and motion flow joint prediction—including strong reliance on depth priors, 2D–3D semantic misalignment, large scale discrepancies in dynamic flow estimation, and poor robustness to long-tail categories—this paper proposes a pure convolutional framework. First, an adaptive occlusion-aware feature enhancement and depth denoising module mitigates depth sensitivity. Second, shared semantic prototypes explicitly align 2D and 3D feature spaces. Third, a BEV cost-volume hierarchical prediction architecture integrates classification-regression collaborative supervision with confidence- and class-weighted sampling. Evaluated on the Occ3D benchmark, our method achieves a +2.5% absolute RayIoU improvement and runs at 25 FPS (input resolution 256×704, ResNet-50 backbone), ranking second in the CVPR 2024 Occupancy and Flow Challenge.
📝 Abstract
Vision-based semantic occupancy and flow prediction plays a crucial role in providing spatiotemporal cues for real-world tasks, such as autonomous driving. Existing methods prioritize higher accuracy to cater to the demands of these tasks. In this work, we strive to improve performance by introducing a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. First, we introduce an occlusion-aware adaptive lifting mechanism with a depth denoising technique to improve the robustness of 2D-to-3D feature transformation and reduce the reliance on depth priors. Second, we strengthen the semantic consistency between 3D features and their original 2D modalities by utilizing shared semantic prototypes to jointly constrain both 2D and 3D features. This is complemented by confidence- and category-based sampling strategies to tackle long-tail challenges in 3D space. To alleviate the feature encoding burden in the joint prediction of semantics and flow, we propose a BEV cost volume-based prediction method that links flow and semantic features through a cost volume and employs a classification-regression supervision scheme to address the varying flow scales in dynamic scenes. Our purely convolutional architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy achieving state-of-the-art results on multiple benchmarks. On Occ3D and training without the camera visible mask, our ALOcc achieves an absolute gain of 2.5% in terms of RayIoU while operating at a comparable speed compared to the state-of-the-art, using the same input size (256$ imes$704) and ResNet-50 backbone. Our method also achieves 2nd place in the CVPR24 Occupancy and Flow Prediction Competition.