🤖 AI Summary
To address challenges in infrastructure-side multi-camera 3D object detection—including multi-view geometric heterogeneity, diverse camera configurations, degraded visual quality, and complex road layouts—this paper proposes a Transformer-based bird’s-eye view (BEV) perception framework. The framework supports flexible integration of heterogeneous cameras and introduces a graph-enhanced fusion module that explicitly models camera-to-BEV-grid geometric relationships while jointly aggregating implicit visual features for relation-aware multi-view feature fusion. It further incorporates deformable attention, graph neural networks, and multimodal fusion. Extensive experiments on the synthetic dataset M2I and the real-world dataset RoScenes demonstrate superior performance. Notably, the method maintains high accuracy under adverse conditions such as extreme weather and sensor degradation, achieving state-of-the-art results on both benchmarks and exhibiting strong potential for deployment in practical intelligent transportation systems.
📝 Abstract
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.