🤖 AI Summary
Existing CLIP-style contrastive learning methods for multimodal time-series modeling suffer from overfitting to easily alignable features, erroneously treating cross-modal positive pairs as negatives—leading to fragmented representations and poor generalization. To address this, we propose ProtoMM, the first foundation model for pulse-motion-oriented multimodal physiological signals. ProtoMM introduces a shared prototype dictionary to unify representation spaces across modalities—including ECG, PPG, EDA, and accelerometer data—eliminating negative sampling and enabling cross-modal collaborative clustering with enhanced interpretability. Through self-supervised multimodal temporal encoding, prototype-driven clustering, and shared embedding optimization, ProtoMM establishes a universal “common language” among physiological signals. Evaluated on multiple downstream tasks, ProtoMM achieves state-of-the-art performance, significantly outperforming mainstream contrastive learning and multimodal self-supervised learning approaches.
📝 Abstract
Modeling multi-modal time-series data is critical for capturing system-level dynamics, particularly in biosignals where modalities such as ECG, PPG, EDA, and accelerometry provide complementary perspectives on interconnected physiological processes. While recent self-supervised learning (SSL) advances have improved unimodal representation learning, existing multi-modal approaches often rely on CLIP-style contrastive objectives that overfit to easily aligned features and misclassify valid cross-modal relationships as negatives, resulting in fragmented and non-generalizable embeddings. To overcome these limitations, we propose ProtoMM, a novel SSL framework that introduces a shared prototype dictionary to anchor heterogeneous modalities in a common embedding space. By clustering representations around shared prototypes rather than explicit negative sampling, our method captures complementary information across modalities and provides a coherent "common language" for physiological signals. In this work, we focus on developing a Pulse Motion foundation model with ProtoMM and demonstrate that our approach outperforms contrastive-only and prior multimodal SSL methods, achieving state-of-the-art performance while offering improved interpretability of learned features.