🤖 AI Summary
To address the challenges of zero-label test modalities and severe cross-modal temporal distortion in unsupervised modality adaptation (UMA) for human activity recognition, this paper proposes C3T. C3T introduces a novel time-varying latent vector alignment mechanism grounded in the temporal convolutional receptive field, preserving fine-grained temporal structure lost in conventional single-vector representations. It jointly integrates student–teacher knowledge distillation, cross-modal contrastive alignment, and time-aware consistency regularization to enable robust knowledge transfer within a unified multimodal representation space. Evaluated on a multi-source dataset combining camera and IMU modalities, C3T outperforms existing UMA methods by ≥8% and approaches the fully supervised upper bound. Moreover, it demonstrates significant robustness to temporal noise—marking a substantial advance in practical UMA for time-series multimodal learning.
📝 Abstract
In order to unlock the potential of diverse sensors, we investigate a method to transfer knowledge between modalities using the structure of a unified multimodal representation space for Human Action Recognition (HAR). We formalize and explore an understudied cross-modal transfer setting we term Unsupervised Modality Adaptation (UMA), where the modality used in testing is not used in supervised training, i.e. zero labeled instances of the test modality are available during training. We develop three methods to perform UMA: Student-Teacher (ST), Contrastive Alignment (CA), and Cross-modal Transfer Through Time (C3T). Our extensive experiments on various camera+IMU datasets compare these methods to each other in the UMA setting, and to their empirical upper bound in the supervised setting. The results indicate C3T is the most robust and highest performing by at least a margin of 8%, and nears the supervised setting performance even in the presence of temporal noise. This method introduces a novel mechanism for aligning signals across time-varying latent vectors, extracted from the receptive field of temporal convolutions. Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for multi-modal learning in various applications.