🤖 AI Summary
This work addresses the representation gap and performance degradation caused by missing or low-quality modalities in multimodal human sensing. To tackle these challenges, the authors propose a “purify-then-align” framework that first suppresses the influence of noisy modalities through a meta-learning-driven dynamic modality weighting mechanism, followed by diffused knowledge distillation to transfer the purified multimodal teacher knowledge to a unimodal student model. This approach uniquely integrates meta-learning with diffused knowledge distillation, effectively decoupling the causal dependency between modality corruption and representation disparity. Experimental results on the MM-Fi and XRF55 datasets demonstrate that the method significantly enhances the robustness and performance of unimodal models across diverse modality-missing scenarios.
📝 Abstract
Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel "Purify-then-Align" framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this "Purify-then-Align" strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.