🤖 AI Summary
To address key challenges in IoT multimodal fusion—namely, high model complexity, superficial inter-modal relationship modeling (relying solely on unidirectional alignment), and poor robustness under sensor missingness—this paper proposes GRAM-MAMBA, a lightweight, efficient, and robust framework. Methodologically, it introduces a pairwise modality alignment mechanism and an adaptive low-rank compensation strategy to overcome unidirectional alignment limitations; employs linear-complexity Mamba for temporal sensor data modeling; leverages an optimized GRAM matrix for cross-modal alignment; and incorporates a LoRA-based, incrementally fine-tunable low-rank compensation layer enabling post-training dynamic adaptation to missing modalities. Evaluated on SPAWC2021 and USC-HAD, GRAM-MAMBA achieves 24.5% improvement in localization accuracy and 23% gain in F1-score—reaching 93.55% F1—with only 0.2–0.3% parameter fine-tuning, significantly outperforming baselines.
📝 Abstract
Multi-modal fusion is crucial for Internet of Things (IoT) perception, widely deployed in smart homes, intelligent transport, industrial automation, and healthcare. However, existing systems often face challenges: high model complexity hinders deployment in resource-constrained environments, unidirectional modal alignment neglects inter-modal relationships, and robustness suffers when sensor data is missing. These issues impede efficient and robust multimodal perception in real-world IoT settings. To overcome these limitations, we propose GRAM-MAMBA. This framework utilizes the linear-complexity Mamba model for efficient sensor time-series processing, combined with an optimized GRAM matrix strategy for pairwise alignment among modalities, addressing the shortcomings of traditional single-modality alignment. Inspired by Low-Rank Adaptation (LoRA), we introduce an adaptive low-rank layer compensation strategy to handle missing modalities post-training. This strategy freezes the pre-trained model core and irrelevant adaptive layers, fine-tuning only those related to available modalities and the fusion process. Extensive experiments validate GRAM-MAMBA's effectiveness. On the SPAWC2021 indoor positioning dataset, the pre-trained model shows lower error than baselines; adapting to missing modalities yields a 24.5% performance boost by training less than 0.2% of parameters. On the USC-HAD human activity recognition dataset, it achieves 93.55% F1 and 93.81% Overall Accuracy (OA), outperforming prior work; the update strategy increases F1 by 23% while training less than 0.3% of parameters. These results highlight GRAM-MAMBA's potential for achieving efficient and robust multimodal perception in resource-constrained environments.