Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

📅 2024-07-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of unsupervised remote photoplethysmography (rPPG) for physiological measurement—specifically, estimating heart and respiratory rates without synchronized ground-truth PPG labels. We propose the first self-supervised rPPG framework leveraging vision-language models (VLMs). Methodologically, we introduce frequency-aware video augmentation, spatiotemporal graph–text pair construction, and multi-task collaborative optimization to enable cross-modal alignment of VLMs with physiological frequency knowledge. Key innovations include frequency-guided vision–text contrastive learning, text-guided spatiotemporal graph reconstruction, and a frequency-ranking task. Evaluated on four benchmark datasets, our approach significantly outperforms existing self-supervised rPPG methods in both heart rate and respiratory rate estimation, demonstrating superior accuracy, robustness, and generalization. Results validate the effectiveness of incorporating physiological frequency priors into VLM-based modeling.

Technology Category

Application Category

📝 Abstract

Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised remote physiological measurement

Integrating vision-language models

Frequency-oriented vision-text pair generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised remote physiological measurement

vision-language models integration

frequency-oriented vision-text pair generation

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker