VPOcc: Exploiting Vanishing Point for Monocular 3D Semantic Occupancy Prediction

📅 2024-08-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

185K/year

🤖 AI Summary

In monocular RGB-based 3D semantic occupancy prediction, perspective projection induces 2D–3D scale inconsistency and depth-wise feature imbalance. To address this, we propose the first end-to-end framework incorporating vanishing point (VP) geometric priors. Our method introduces three core components: (1) VPZoomer, an adaptive image resampling module guided by the vanishing point to correct perspective distortion; (2) VPCA, a perspective-aware cross-attention mechanism that explicitly encodes 3D geometric constraints; and (3) BVFV, a balanced voxel fusion module enabling geometrically aligned, weighted aggregation of multi-scale voxel features. Notably, VP priors are systematically integrated across feature extraction, cross-modal interaction, and voxel fusion—unprecedented in prior work. Evaluated on SemanticKITTI and SSCBench-KITTI360, our approach achieves state-of-the-art IoU and mIoU, significantly mitigating near–far feature imbalance.

Technology Category

Application Category

📝 Abstract

Monocular 3D semantic occupancy prediction is becoming important in robot vision due to the compactness of using a single RGB camera. However, existing methods often do not adequately account for camera perspective geometry, resulting in information imbalance along the depth range of the image. To address this issue, we propose a vanishing point (VP) guided monocular 3D semantic occupancy prediction framework named VPOcc. Our framework consists of three novel modules utilizing VP. First, in the VPZoomer module, we initially utilize VP in feature extraction to achieve information balanced feature extraction across the scene by generating a zoom-in image based on VP. Second, we perform perspective geometry-aware feature aggregation by sampling points towards VP using a VP-guided cross-attention (VPCA) module. Finally, we create an information-balanced feature volume by effectively fusing original and zoom-in voxel feature volumes with a balanced feature volume fusion (BVFV) module. Experiments demonstrate that our method achieves state-of-the-art performance for both IoU and mIoU on SemanticKITTI and SSCBench-KITTI360. These results are obtained by effectively addressing the information imbalance in images through the utilization of VP. Our code will be available at www.github.com/anonymous.

Problem

Research questions and friction points this paper is trying to address.

Addresses 2D-3D discrepancy in semantic occupancy prediction

Leverages vanishing points to improve perspective accuracy

Enhances 3D scene understanding for autonomous navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

VPZoomer module warps images using VP-based homography

VP-guided cross-attention enables perspective-aware feature aggregation

Spatial volume fusion integrates original and warped image features

🔎 Similar Papers

No similar papers found.