ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

📅 2024-05-27
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
To address cross-modal misalignment between LiDAR and camera BEV features—caused by sensor calibration errors and inaccurate depth estimation—this paper proposes a robust alignment framework based on contrastive learning. The method introduces (1) dual modality-specific instance modeling modules—L-Instance for LiDAR and C-Instance for camera—to construct BEV feature instances tailored to each sensor’s characteristics; and (2) an InstanceFusion mechanism integrating contrastive alignment with graph matching, jointly optimizing local geometric consistency and global structural correspondence. Evaluated on the nuScenes validation set, the approach achieves 70.3% mAP, outperforming BEVFusion by 1.8%. Under calibrated and depth-noise perturbations, it demonstrates superior robustness, exceeding BEVFusion by 7.3% in performance degradation resistance. These results substantiate significant improvements in both accuracy and stability of multi-modal BEV fusion.

Technology Category

Application Category

📝 Abstract
In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a novel ContrastAlign approach that utilizes contrastive learning to enhance the alignment of heterogeneous modalities, thereby improving the robustness of the fusion process. Specifically, our approach includes the L-Instance module, which directly outputs LiDAR instance features within LiDAR BEV features. Then, we introduce the C-Instance module, which predicts camera instance features through RoI (Region of Interest) pooling on the camera BEV features. We propose the InstanceFusion module, which utilizes contrastive learning to generate similar instance features across heterogeneous modalities. We then use graph matching to calculate the similarity between the neighboring camera instance features and the similarity instance features to complete the alignment of instance features. Our method achieves state-of-the-art performance, with an mAP of 70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set. Importantly, our method outperforms BEVFusion by 7.3% under conditions with misalignment noise.
Problem

Research questions and friction points this paper is trying to address.

Addressing feature misalignment in LiDAR-camera BEV fusion
Improving robustness against inaccurate sensor calibration errors
Enhancing cross-modal feature alignment through contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning for robust BEV feature alignment
L-Instance and C-Instance modules extract features
Graph matching calculates similarity for instance alignment
🔎 Similar Papers
No similar papers found.
Ziying Song
Ziying Song
Beijing Jiaotong University
Object DetectionComputer VisionDeep Learning
Feiyang Jia
Feiyang Jia
Beijing Jiaotong University
Hongyu Pan
Hongyu Pan
Alibaba DAMO Academy, Autonomous Driving Lab
Computer VisionDetectionSegmentationPoint CloudMotion,End2End
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving
C
Caiyan Jia
School of Computer Science and Technology, Beijing Jiaotong University, China; Beijing Key Lab of Traffic Data Analysis and Mining, China
Guoxin Zhang
Guoxin Zhang
School of Computer Science, Beijing University of Posts and Telecommunications
Computer VisionPattern Recognition
L
Lin Liu
School of Computer Science and Technology, Beijing Jiaotong University, China; Beijing Key Lab of Traffic Data Analysis and Mining, China
Y
Yang Ji
Horizon Robotics
L
Lei Yang
Tsinghua University, China
L
Li Wang
Beijing Institute of Technology, China