VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

๐Ÿ“… 2025-01-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing remote photoplethysmography (rPPG) methods exhibit unstable performance across datasets of varying scales, primarily due to the limited representational capacity of standalone CNNs or Transformers, absence of physiological priors, and poor generalization under low-data regimes. To address these limitations, we propose the first end-to-end 3D-CNNโ€“Transformer collaborative framework for spatiotemporal hemodynamic modeling from facial videos. Our approach features: (1) an enhanced skin reflectance physical model to enforce physiologically consistent signal reconstruction; (2) a dual-path spatiotemporal attention mechanism that explicitly decouples motion artifacts from hemodynamic responses; and (3) a cross-modal feature fusion module to improve robustness. Evaluated on five public benchmarks, our method achieves new state-of-the-art performance, reducing heart rate estimation MAE by 18.7% over prior works, while demonstrating significantly improved robustness to skin tone variation, cosmetic occlusion, and head motion.

Technology Category

Application Category

๐Ÿ“ Abstract
Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.
Problem

Research questions and friction points this paper is trying to address.

Deep Learning
rPPG
Data Scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

3DCNN-Transformer Fusion
Temporal Spatial Attention Mechanism
Information Interaction Module
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiachen Li
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731 China
Shisheng Guo
Shisheng Guo
University of Electronic Science and Technology of China
UWB radar imagingSensor networkOptimization theory
L
Longzhen Tang
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731 China
C
Cuolong Cui
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731 China; Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou 324000, China
L
Lingjiang Kong
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731 China
X
Xiaobo Yang
School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731 China