🤖 AI Summary
To address the heavy reliance of BEV semantic segmentation on large-scale manual annotations, this paper proposes the first self-supervised training framework based on differentiable volumetric rendering. Our method projects predictions from a pre-trained 2D semantic segmentation model into the BEV space via differentiable volumetric rendering, generating dense, geometrically consistent pseudo-labels—enabling BEV self-supervised pre-training without any BEV annotations. The framework integrates BEV feature-space modeling, knowledge distillation, and joint optimization. Experiments demonstrate: (1) state-of-the-art zero-shot BEV segmentation performance; (2) over 15 percentage-point mIoU improvement when fine-tuned with only 1% labeled data; and (3) new state-of-the-art results on major benchmarks (e.g., nuScenes) under full supervision. This work establishes a novel paradigm for low-resource BEV understanding.
📝 Abstract
Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self-supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine-tune on labeled BEV ground-truth, our method significantly boosts performance in low-annotation regimes, and sets a new state of the art when fine-tuning on all available labels.