Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficiency of single-modality 3D object detection in autonomous driving by proposing MMF-BEV, a multimodal fusion framework that enhances radar and camera features individually through deformable self-attention and achieves precise alignment and fusion of sparse radar points with dense visual features in bird’s-eye-view (BEV) space via deformable cross-attention. The method employs a two-stage training strategy and depth supervision to improve stability, and introduces an interpretable sensor contribution analysis mechanism to quantify modality weights at different distances. Experiments on the View-of-Delft dataset demonstrate that the proposed approach significantly outperforms single-modality baselines across both the full evaluation region and near-range regions of interest, achieving state-of-the-art or highly competitive detection performance.
📝 Abstract
Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.
Problem

Research questions and friction points this paper is trying to address.

multi-modal sensor fusion
3D object detection
autonomous driving
camera-radar fusion
BEV representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deformable Attention
BEV Fusion
Radar-Camera Fusion
Multi-Modal Sensor Fusion
Autonomous Driving
M
Mayank Mayank
Research and Development, Mercedes-Benz AG, Germany
B
Bharanidhar Duraisamy
Research and Development, Mercedes-Benz AG, Germany
F
Florian Geiß
Research and Development, Mercedes-Benz AG, Germany
Abhinav Valada
Abhinav Valada
Professor & Director of Robot Learning Lab, University of Freiburg
RoboticsMachine LearningComputer VisionArtificial Intelligence