🤖 AI Summary
To address the scarcity of 2D semantic and instance annotations under rare viewpoints in autonomous driving—severely limiting perception model generalization—this paper proposes a geometry-semantic co-driven panoramic label generation framework. Our method jointly optimizes a 3D semantic field by fusing coarse-grained 3D geometric priors with noisy 2D semantic cues, enabling the first-ever mutual enhancement between 3D geometry and 2D semantics. A semantic-field denoising mechanism is introduced to suppress 2D annotation noise. Leveraging NeRF-based rendering, hybrid features from hash encoding and MLPs, and pseudo-ground-truth-guided supervision, our approach synthesizes high-fidelity, omnidirectional, multi-view, and spatiotemporally consistent panoramic labels—including appearance, semantic, and instance masks. Evaluated on KITTI-360, our method achieves state-of-the-art performance in cross-view label transfer and significantly improves model generalization across unseen viewpoints.
📝 Abstract
Training perception systems for self-driving cars requires substantial 2D annotations that are labor-intensive to manual label. While existing datasets provide rich annotations on pre-recorded sequences, they fall short in labeling rarely encountered viewpoints, potentially hampering the generalization ability for perception models. In this paper, we present PanopticNeRF-360, a novel approach that combines coarse 3D annotations with noisy 2D semantic cues to generate high-quality panoptic labels and images from any viewpoint. Our key insight lies in exploiting the complementarity of 3D and 2D priors to mutually enhance geometry and semantics. Specifically, we propose to leverage coarse 3D bounding primitives and noisy 2D semantic and instance predictions to guide geometry optimization, by encouraging predicted labels to match panoptic pseudo ground truth. Simultaneously, the improved geometry assists in filtering 3D&2D annotation noise by fusing semantics in 3D space via a learned semantic field. To further enhance appearance, we combine MLP and hash grids to yield hybrid scene features, striking a balance between high-frequency appearance and contiguous semantics. Our experiments demonstrate PanopticNeRF-360's state-of-the-art performance over label transfer methods on the challenging urban scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360 enables omnidirectional rendering of high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance labels. We make our code and data available at https://github.com/fuxiao0719/PanopticNeRF