Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the representation gap and performance degradation caused by missing or low-quality modalities in multimodal human sensing. To tackle these challenges, the authors propose a “purify-then-align” framework that first suppresses the influence of noisy modalities through a meta-learning-driven dynamic modality weighting mechanism, followed by diffused knowledge distillation to transfer the purified multimodal teacher knowledge to a unimodal student model. This approach uniquely integrates meta-learning with diffused knowledge distillation, effectively decoupling the causal dependency between modality corruption and representation disparity. Experimental results on the MM-Fi and XRF55 datasets demonstrate that the method significantly enhances the robustness and performance of unimodal models across diverse modality-missing scenarios.
📝 Abstract
Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel "Purify-then-Align" framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this "Purify-then-Align" strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.
Problem

Research questions and friction points this paper is trying to address.

modality missing
representation gap
contamination effect
robust human sensing
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Purify-then-Align
knowledge distillation
meta-learning
multimodal robustness
modality missing
🔎 Similar Papers
No similar papers found.
P
Pengcheng Weng
School of Software Engineering, Xi’an Jiaotong University, China; Institute of Computer Science, Universit¨at Bern, Switzerland
Y
Yanyu Qian
College of Computing and Data Science, Nanyang Technological University, Singapore; School of Software Engineering, Xi’an Jiaotong University, China
Yangxin Xu
Yangxin Xu
The Chinese University of Hong Kong
Fei Wang
Fei Wang
Xi'an Jiaotong University
computer visionartificial intelligence