EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational and memory overhead in self-supervised semantic occupancy prediction caused by expensive rendering (e.g., novel view synthesis), this paper proposes a fully self-supervised pseudo-label densification framework. Our method first integrates Grounded-SAM with Metric3Dv2 to generate high-fidelity 3D pseudo-ground-truth labels, then introduces temporal consistency regularization to enhance label density—eliminating explicit rendering entirely. Notably, the approach requires no camera masks and exhibits strong generalization across scenes. Evaluated on OccNeRF, our method achieves 14.09 mIoU (+45% over baseline); on EasyOcc, it attains 7.71 mIoU across full scenes—surpassing prior state-of-the-art by 31%. The framework significantly improves both accuracy and training efficiency while maintaining full self-supervision.

Technology Category

Application Category

📝 Abstract
Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in self-supervised semantic occupancy prediction
Generating 3D pseudo-labels using foundation models for supervision
Improving performance without complex rendering strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 3D pseudo-labels using foundation models
Uses temporal information for label densification
Simplifies model by avoiding complex rendering strategies
🔎 Similar Papers
No similar papers found.
S
Seamie Hayes
Dept. of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Ganesh Sistu
Ganesh Sistu
Principal Artificial Intelligence Architect, Valeo Ireland
Autonomous DrivingMachine LearningComputer VisionDeep Learning
Ciarán Eising
Ciarán Eising
University of Limerick
computer vision