MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing camera–4D millimeter-wave radar fusion methods—largely adapted from LiDAR-centric explicit bird’s-eye view (BEV) paradigms—overlook the inherent geometric limitations of sparse and noisy 4D radar point clouds. To address this, we propose a three-tier collaborative fusion framework: point-level, scene-level, and proposal-level. Our method introduces a novel Triple-Attention voxel encoder, hierarchical scene fusion pooling, and a proposal-level feature enhancement module, synergistically integrating deformable attention and multi-scale feature fusion to achieve fine-grained cross-modal feature alignment and complementarity. Evaluated on View-of-Delft and TJ4DRadSet, our approach achieves state-of-the-art (SOTA) performance. Notably, on the VoD dataset, its 3D object detection accuracy matches that of LiDAR-based baselines, while significantly improving robustness and precision for 4D radar–only 3D detection.

Technology Category

Application Category

📝 Abstract
The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird's-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar's inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.
Problem

Research questions and friction points this paper is trying to address.

Addressing sparse and noisy point clouds from 4D radar in autonomous driving
Overcoming limitations of existing radar-camera fusion methods for 3D detection
Improving 3D object detection performance to match LiDAR-based approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level fusion of 4D radar and camera data
Enhanced Radar Point Encoder densifies sparse point clouds
Hierarchical scene and proposal-level fusion for refinement
🔎 Similar Papers
No similar papers found.
Y
Yuzhi Wu
MoE Key Laboratory of Brain-Inspired Intelligence Perception and Cognition, University of Science and Technology of China, Hefei 230052, China
L
Li Xiao
MoE Key Laboratory of Brain-Inspired Intelligence Perception and Cognition, University of Science and Technology of China, Hefei 230052, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China
J
Jun Liu
Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China
G
Guangfeng Jiang
Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China
Xiang-Gen Xia
Xiang-Gen Xia
Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA
signal processingdigital communicationsradar signal processing