🤖 AI Summary
This study addresses the lack of large-scale, high-quality clinical benchmarks for photoplethysmography (PPG)-based algorithms, which has hindered fair model comparison. We present the first unified PPG benchmark for multitask clinical prediction, encompassing arrhythmia classification—including the first systematic evaluation beyond atrial fibrillation and flutter—as well as regression tasks for respiratory rate, heart rate, and blood pressure. Leveraging the MIMIC-III-Ext-PPG dataset, we employ established deep learning architectures and conduct cross-dataset validation. Results demonstrate strong performance: atrial fibrillation detection achieves an AUROC of 0.96 (0.97 cross-dataset), while physiological parameter estimation yields low errors (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Further analysis reveals that performance disparities stem from population-specific waveform characteristics rather than model bias.
📝 Abstract
Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.