🤖 AI Summary
This study systematically evaluates the performance disparity between general-purpose and domain-specific foundation models for photoplethysmography (PPG)-driven human health assessment across 51 clinically relevant tasks—including cardiac state identification, laboratory biomarker prediction, and cross-modal reasoning. Methodologically, it introduces the first comprehensive benchmark spanning seven dimensions: model architecture, data efficiency, feature quality, attention mechanism, transferability, robustness, and generalization. Employing full fine-tuning alongside multidimensional evaluation—win-rate scoring, attention visualization, and feature separability analysis—it demonstrates that domain-specific models achieve a 27% higher win rate overall, exhibiting superior physiological signal modeling capability, stability, and clinical adaptability. The core contribution lies in quantifying the critical gains conferred by domain-specialized design for PPG-based health inference and elucidating the synergistic interplay between data curation and architectural choices in governing generalization performance.
📝 Abstract
Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models' capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.