🤖 AI Summary
This work addresses the limitation of existing 3D bird’s-eye-view (BEV) object detection methods that employ uniform random masking during multimodal pretraining, which neglects semantically critical regions and hampers representation learning. To overcome this, the authors propose a Semantic-Guided Multimodal Masked Autoencoder (SG-M2AE) that integrates semantic priors into pretraining. Specifically, they design a semantic-guided LiDAR voxel masking strategy that preferentially preserves regions with high semantic value and introduce a point-level semantic decoder head as an auxiliary supervision signal to enhance cross-modal representation learning. Evaluated on the nuScenes mini validation set, SG-M2AE significantly outperforms the UniM2AE baseline, achieving a 1.49% improvement in mAP and a 3.22% gain in NDS, thereby demonstrating the effectiveness and novelty of the proposed semantic-guided mechanism.
📝 Abstract
Accurate 3D bird's-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.