🤖 AI Summary
This work addresses key challenges in multimodal 3D object detection—inefficient context modeling, spatially inflexible fusion strategies, and insufficient reasoning under uncertainty—by proposing MambaFusion, a novel framework that integrates selective state space models (SSMs) with windowed Transformers. This combination enables global context propagation and local geometric preservation at linear computational complexity. The method further enhances physical plausibility and confidence calibration through multimodal token alignment, a reliability-aware fusion gate, and a structurally constrained diffusion head. Evaluated on the nuScenes benchmark, MambaFusion achieves a new state of the art, demonstrating significantly improved detection robustness, temporal stability, and model interpretability.
📝 Abstract
Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.