🤖 AI Summary
This work addresses the challenges posed by the sparse and weak geometric cues in 4D radar data, which hinder effective instance activation, and the limitations of existing radar-camera fusion methods that suffer from insufficient instance awareness and lack of global context when operating at either the bird’s-eye-view (BEV) or perspective-view level. To overcome these issues, the authors propose SIFormer, a novel framework that first suppresses background noise through segmentation- and depth-guided view transformation, then introduces a cross-view instance activation mechanism to effectively propagate 2D instance cues into BEV space, and finally integrates image semantics and radar geometry via a Transformer-based fusion module. By bridging the complementary strengths of BEV and perspective-view fusion through the first-ever cross-view instance activation in 4D radar-camera perception, SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet, and NuScenes, significantly improving 3D object detection accuracy under sparse radar conditions.
📝 Abstract
4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.