Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the insufficiency of single-modality 3D object detection in autonomous driving by proposing MMF-BEV, a multimodal fusion framework that enhances radar and camera features individually through deformable self-attention and achieves precise alignment and fusion of sparse radar points with dense visual features in bird’s-eye-view (BEV) space via deformable cross-attention. The method employs a two-stage training strategy and depth supervision to improve stability, and introduces an interpretable sensor contribution analysis mechanism to quantify modality weights at different distances. Experiments on the View-of-Delft dataset demonstrate that the proposed approach significantly outperforms single-modality baselines across both the full evaluation region and near-range regions of interest, achieving state-of-the-art or highly competitive detection performance.

Technology Category

Application Category

📝 Abstract

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Problem

Research questions and friction points this paper is trying to address.

multi-modal sensor fusion

3D object detection

autonomous driving

camera-radar fusion

BEV representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deformable Attention

BEV Fusion

Radar-Camera Fusion