ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing purely vision-based occupancy estimation methods rely on 2D projection or rendering supervision, leading to geometric inconsistency and severe depth information loss. To address this, we propose ShelfOcc—the first LiDAR-free, end-to-end 3D-occupancy estimation method with native 3D supervision. Leveraging video sequences, ShelfOcc models inter-frame static geometric consistency to automatically generate metrically accurate and semantically complete voxel ground-truth labels—without manual 3D annotation or auxiliary sensors. Our approach integrates static geometry filtering, dynamic content disentanglement, and semantic propagation to produce robust, generalizable 3D occupancy representations compatible with mainstream architectures. On the Occ3D-nuScenes benchmark, ShelfOcc significantly outperforms all weakly and self-supervised methods, achieving a +34% relative improvement in mIoU. It establishes, for the first time, a high-accuracy, scalable 3D supervision paradigm for LiDAR-free 3D scene understanding.

Technology Category

Application Category

📝 Abstract
Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Overcoming geometric inconsistencies in vision-based 3D occupancy estimation
Generating consistent 3D voxel labels without LiDAR or manual annotations
Mitigating sparse noisy geometry in dynamic driving scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 3D semantic voxel labels from video
Filters and accumulates static geometry across frames
Enables LiDAR-free supervision for occupancy estimation
🔎 Similar Papers
No similar papers found.