Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

📅 2024-07-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unsupervised remote photoplethysmography (rPPG) for physiological measurement—specifically, estimating heart and respiratory rates without synchronized ground-truth PPG labels. We propose the first self-supervised rPPG framework leveraging vision-language models (VLMs). Methodologically, we introduce frequency-aware video augmentation, spatiotemporal graph–text pair construction, and multi-task collaborative optimization to enable cross-modal alignment of VLMs with physiological frequency knowledge. Key innovations include frequency-guided vision–text contrastive learning, text-guided spatiotemporal graph reconstruction, and a frequency-ranking task. Evaluated on four benchmark datasets, our approach significantly outperforms existing self-supervised rPPG methods in both heart rate and respiratory rate estimation, demonstrating superior accuracy, robustness, and generalization. Results validate the effectiveness of incorporating physiological frequency priors into VLM-based modeling.

Technology Category

Application Category

📝 Abstract
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised remote physiological measurement
Integrating vision-language models
Frequency-oriented vision-text pair generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised remote physiological measurement
vision-language models integration
frequency-oriented vision-text pair generation
🔎 Similar Papers
No similar papers found.
Z
Zijie Yue
College of Electronic and Information Engineering, Tongji University, China
Miaojing Shi
Miaojing Shi
Professor at Tongji University, Visiting Senior Lecturer at King's College London
Computer Vision
Hanli Wang
Hanli Wang
Tongji University
Multimedia ComputingComputer VisionImage ProcessingMachine Learning
Shuai Ding
Shuai Ding
School of Management, Hefei University of Technology, China
Q
Qijun Chen
College of Electronic and Information Engineering, Tongji University, China
S
Shanlin Yang
School of Management, Hefei University of Technology, China