🤖 AI Summary
To address core challenges in remote photoplethysmography (rPPG)—including high sensitivity to illumination variations, severe motion artifacts, and weak temporal modeling capability—this paper proposes the first large language model (LLM)-collaborative optimization framework tailored for physiological signal estimation. Methodologically, it introduces a novel text prototype guidance (TPG) mechanism to enable cross-modal alignment between rPPG signals and semantic representations; designs a dual-domain stationarity (DDS) algorithm to adaptively reweight time-frequency features for enhanced robustness; and systematically incorporates three types of domain-specific priors: physiological statistics, environmental context, and task descriptions. Evaluated on four benchmark datasets, the proposed method consistently outperforms existing state-of-the-art approaches, demonstrating superior generalization and measurement accuracy—particularly under challenging conditions involving complex illumination and dynamic subject motion.
📝 Abstract
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.