🤖 AI Summary
Modeling intra-personal facial dynamics is challenging due to the scarcity of real-world temporal face image sequences. Method: We propose a transferable facial representation learning framework trained exclusively on synthetic data. Our approach decouples inter-personal variation from intra-personal temporal dynamics via a dual-branch contrastive learning architecture, integrating cross-task feature transfer and unsupervised feature disentanglement to jointly model expression, weight, and age-related facial changes. Contribution/Results: The method requires no real-world temporal annotations yet achieves state-of-the-art (SOTA) or superior performance on three downstream tasks—demonstrating, for the first time, that synthetic data alone suffices for robust and generalizable temporal facial representation learning. This establishes a novel paradigm for applications in clinical monitoring and affective computing.
📝 Abstract
Daily monitoring of intra-personal facial changes associated with health and emotional conditions has great potential to be useful for medical, healthcare, and emotion recognition fields. However, the approach for capturing intra-personal facial changes is relatively unexplored due to the difficulty of collecting temporally changing face images. In this paper, we propose a facial representation learning method using synthetic images for comparing faces, called ComFace, which is designed to capture intra-personal facial changes. For effective representation learning, ComFace aims to acquire two feature representations, i.e., inter-personal facial differences and intra-personal facial changes. The key point of our method is the use of synthetic face images to overcome the limitations of collecting real intra-personal face images. Facial representations learned by ComFace are transferred to three extensive downstream tasks for comparing faces: estimating facial expression changes, weight changes, and age changes from two face images of the same individual. Our Com-Face, trained using only synthetic data, achieves comparable to or better transfer performance than general pretraining and state-of-the-art representation learning methods trained using real images.