MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address challenges in infrastructure-side multi-camera 3D object detection—including multi-view geometric heterogeneity, diverse camera configurations, degraded visual quality, and complex road layouts—this paper proposes a Transformer-based bird’s-eye view (BEV) perception framework. The framework supports flexible integration of heterogeneous cameras and introduces a graph-enhanced fusion module that explicitly models camera-to-BEV-grid geometric relationships while jointly aggregating implicit visual features for relation-aware multi-view feature fusion. It further incorporates deformable attention, graph neural networks, and multimodal fusion. Extensive experiments on the synthetic dataset M2I and the real-world dataset RoScenes demonstrate superior performance. Notably, the method maintains high accuracy under adverse conditions such as extreme weather and sensor degradation, achieving state-of-the-art results on both benchmarks and exhibiting strong potential for deployment in practical intelligent transportation systems.

Technology Category

Application Category

📝 Abstract

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

Problem

Research questions and friction points this paper is trying to address.

Addressing 3D object detection challenges in multi-camera infrastructure setups

Handling diverse camera configurations and degraded visual inputs effectively

Improving robustness under extreme weather and sensor degradation conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based BEV framework for multi-camera 3D detection

Graph-enhanced fusion module using geometric relationships

Robust performance under sensor degradation and weather conditions

🔎 Similar Papers

No similar papers found.