๐ค AI Summary
In real-time camera-based 3D occupancy prediction, low-resolution voxel queries severely degrade geometric and semantic fidelity. To address this, we propose a prototype-aware view transformation framework. Our method explicitly models high-order visual structures via a novel clustering-based image-segment prototype mechanism for view transformation. Additionally, we design a multi-view feature disentanglement module and a lightweight voxel decoder, enabling high-fidelity 3D occupancy reconstruction even at extremely low resolutionโonly 25% of the original voxel count. Evaluated on Occ3D and SemanticKITTI, our approach achieves performance competitive with full-resolution baselines despite operating at just 25% resolution (i.e., a 75% reduction), significantly outperforming existing real-time methods. The key contributions include: (1) the first prototype-driven view transformation leveraging clustered image-segment representations; and (2) an efficient, resolution-robust architecture that preserves structural and semantic integrity under severe voxel downsampling.
๐ Abstract
The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75% reduced voxel resolution.