🤖 AI Summary
Existing automated sleep staging methods rely heavily on handcrafted polysomnography (PSG) features and domain-specific models, suffering from poor interpretability and high data requirements. To address these limitations, we propose a multimodal foundation modeling paradigm: raw PSG time-series signals are losslessly transformed into 2D waveform images to preserve temporal structure and emulate clinical visual inspection; for the first time, a general-purpose multimodal large language model is adapted to sleep staging via end-to-end fine-tuning, enabling cross-modal feature fusion and attention-driven interpretability. Evaluated on three large-scale public benchmarks—ISRUC, MASS, and SHHS—our method achieves significant improvements over state-of-the-art approaches in accuracy, robustness, and generalizability. Results demonstrate strong clinical applicability and highlight the paradigm’s potential for broader biomedical signal analysis.
📝 Abstract
Sleep staging is essential for diagnosing sleep disorders and assessing neurological health. Existing automatic methods typically extract features from complex polysomnography (PSG) signals and train domain-specific models, which often lack intuitiveness and require large, specialized datasets. To overcome these limitations, we introduce a new paradigm for sleep staging that leverages large multimodal general-purpose models to emulate clinical diagnostic practices. Specifically, we convert raw one-dimensional PSG time-series into intuitive two-dimensional waveform images and then fine-tune a multimodal large model to learn from these representations. Experiments on three public datasets (ISRUC, MASS, SHHS) demonstrate that our approach enables general-purpose models, without prior exposure to sleep data, to acquire robust staging capabilities. Moreover, explanation analysis reveals our model learned to mimic the visual diagnostic workflow of human experts for sleep staging by PSG images. The proposed method consistently outperforms state-of-the-art baselines in accuracy and robustness, highlighting its efficiency and practical value for medical applications. The code for the signal-to-image pipeline and the PSG image dataset will be released.