BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation

📅 2025-03-05

🏛️ Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

189K/year

🤖 AI Summary

To address the severe performance degradation of BEV-based dynamic object motion understanding under low-light and adverse weather conditions, this paper proposes BEVMOSNet—the first end-to-end, camera–LiDAR–radar tri-modal BEV motion segmentation framework. It introduces a deformable cross-attention mechanism for explicit inter-sensor feature alignment and fusion, intrinsically incorporates radar-derived velocity measurements, and models multi-sensor spatiotemporal synchronization relationships—thereby significantly enhancing robustness and generalization. On the nuScenes dataset, BEVMOSNet achieves a 36.59% IoU improvement over the vision-only baseline BEV-MoSeg and outperforms the extended multi-modal baseline SimpleBEV by 2.35%, establishing new state-of-the-art performance in BEV motion segmentation.

Technology Category

Application Category

📝 Abstract

Accurate motion understanding of the dynamic objects within the scene in bird's-eye-view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision-based approaches presenting preliminary findings that significantly deteriorate in low-light, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end-to-end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross-attention-guided sensor fusion for cross-sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision-based unimodal baseline BEV-MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state-of-the-art in BEV motion segmentation.

Problem

Research questions and friction points this paper is trying to address.

Accurate BEV motion understanding for autonomous vehicles

Challenges in low-light, nighttime, and adverse weather conditions

Optimal multimodal fusion of cameras, LiDAR, and radar

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion using cameras, LiDAR, radar

Deformable cross-attention-guided sensor fusion

State-of-the-art BEV motion segmentation

🔎 Similar Papers

No similar papers found.