Leveraging Shared Prototypes for a Multimodal Pulse Motion Foundation Model

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing CLIP-style contrastive learning methods for multimodal time-series modeling suffer from overfitting to easily alignable features, erroneously treating cross-modal positive pairs as negatives—leading to fragmented representations and poor generalization. To address this, we propose ProtoMM, the first foundation model for pulse-motion-oriented multimodal physiological signals. ProtoMM introduces a shared prototype dictionary to unify representation spaces across modalities—including ECG, PPG, EDA, and accelerometer data—eliminating negative sampling and enabling cross-modal collaborative clustering with enhanced interpretability. Through self-supervised multimodal temporal encoding, prototype-driven clustering, and shared embedding optimization, ProtoMM establishes a universal “common language” among physiological signals. Evaluated on multiple downstream tasks, ProtoMM achieves state-of-the-art performance, significantly outperforming mainstream contrastive learning and multimodal self-supervised learning approaches.

Technology Category

Application Category

📝 Abstract

Modeling multi-modal time-series data is critical for capturing system-level dynamics, particularly in biosignals where modalities such as ECG, PPG, EDA, and accelerometry provide complementary perspectives on interconnected physiological processes. While recent self-supervised learning (SSL) advances have improved unimodal representation learning, existing multi-modal approaches often rely on CLIP-style contrastive objectives that overfit to easily aligned features and misclassify valid cross-modal relationships as negatives, resulting in fragmented and non-generalizable embeddings. To overcome these limitations, we propose ProtoMM, a novel SSL framework that introduces a shared prototype dictionary to anchor heterogeneous modalities in a common embedding space. By clustering representations around shared prototypes rather than explicit negative sampling, our method captures complementary information across modalities and provides a coherent "common language" for physiological signals. In this work, we focus on developing a Pulse Motion foundation model with ProtoMM and demonstrate that our approach outperforms contrastive-only and prior multimodal SSL methods, achieving state-of-the-art performance while offering improved interpretability of learned features.

Problem

Research questions and friction points this paper is trying to address.

Modeling multimodal time-series data for physiological system dynamics

Overcoming contrastive learning limitations in cross-modal relationships

Developing shared prototype framework for biosignal representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared prototype dictionary anchors heterogeneous modalities

Clustering around prototypes captures complementary cross-modal information

Provides common language for physiological signal embeddings

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition