SiMO: Single-Modality-Operable Multimodal Collaborative Perception

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation in multimodal cooperative perception caused by semantic inconsistency when critical sensors (e.g., LiDAR) fail during feature fusion. To this end, the authors propose SiMO, a novel framework that, for the first time, enables multimodal fusion while supporting arbitrary single-modality operation independently. The core of SiMO consists of a Length-Adaptive Multimodal Fusion Mechanism (LAMMA) and a four-stage training strategy—pretraining, alignment, fusion, and robustness distillation (RD)—which explicitly aligns and decouples features to ensure both modality independence and semantic consistency. Experimental results demonstrate that SiMO achieves state-of-the-art performance across all single- and multi-modal settings, significantly enhancing system robustness under partial sensor failure scenarios.

Technology Category

Application Category

📝 Abstract
Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative"Pretrain-Align-Fuse-RD"training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.
Problem

Research questions and friction points this paper is trying to address.

collaborative perception
multimodal fusion
modality failure
semantic mismatch
modality competition
Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative perception
single-modality-operable
multi-modal fusion
modality failure
semantic consistency
🔎 Similar Papers
No similar papers found.
J
Jiageng Wen
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University
S
Shengjie Zhao
School of Computer Science and Technology, Tongji University
B
Bing Li
School of Computer Science and Technology, Tongji University
J
Jiafeng Huang
School of Mechatronic Engineering and Automation, Shanghai University
K
Kenan Ye
School of Computer Science and Technology, Tongji University
Hao Deng
Hao Deng
Engineer
recommendation system