🤖 AI Summary
This work addresses the challenge of unified modeling of multimodal nocturnal physiological signals—such as EEG and ECG—hampered by device heterogeneity and sensor dropout. The authors propose a cross-modal alignment pretraining framework that integrates demographic, age, recording site, and medical history metadata to learn robust shared representations. Central to this approach are a metadata-aware InfoNCE loss and a dynamic negative sample weighting mechanism. The study further uncovers, for the first time, a scaling law governing the relationship between modality diversity and model capacity. Evaluated on sleep staging and clinical outcome prediction tasks, the method significantly outperforms strong baselines and demonstrates consistent robustness across arbitrary modality subsets and missing-data scenarios.
📝 Abstract
Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO$_2$). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present \texttt{sleep2vec}, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. \texttt{sleep2vec} is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a \textit{Demography, Age, Site \&History-aware InfoNCE} objective that incorporates physiological and acquisition metadata (\textit{e.g.}, age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, \texttt{sleep2vec} consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.