🤖 AI Summary
To address the significant degradation in multimodal fusion performance—and consequently, the poor robustness of 3D object detection—under adverse weather conditions (e.g., dense fog, heavy snow, atmospheric pollution), this paper proposes a cross-modal fusion framework tailored for extreme environments. The method integrates RGB, LiDAR, near-infrared (NIR) gated imaging, and radar data, incorporating four key innovations: depth-guided feature alignment, attention-driven adaptive weighted fusion, bird’s-eye-view (BEV) feature refinement, and a Transformer-based decoder. Crucially, it introduces dynamic modality weighting conditioned on both distance and visibility estimates, substantially enhancing feature complementarity and discriminability under low-visibility conditions. Evaluated on long-range pedestrian detection and dense-fog benchmarks, the framework achieves a 17.2% absolute improvement in average precision (AP) over state-of-the-art methods, effectively bridging the performance gap between ideal laboratory settings and real-world edge cases.
📝 Abstract
Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird's Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at https://light.princeton.edu/samfusion/