Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation

📅 2025-02-04
🏛️ International Journal of Computer Vision
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
In speech-preserved facial expression manipulation (SPFEM), the core challenge lies in disentangling highly coupled speech content and emotional representations during dynamic speech, while simultaneously preserving lip motion consistency and expression authenticity. This paper proposes a Contrastive Disentangled Representation Learning (CDRL) framework: (1) Content-Guided Audio Contrastive Learning (CCRL) models speech-driven lip dynamics; (2) Emotion-Guided Vision–Language Contrastive Learning (CERL) extracts disentangled emotional representations; and (3) lip–audio synchronization constraints and disentanglement regularization enforce representation orthogonality. To our knowledge, CDRL is the first method achieving *complete* content–emotion disentanglement in SPFEM. It significantly improves expression realism and lip–speech alignment across multiple benchmarks, outperforming state-of-the-art methods both quantitatively and qualitatively. The framework enables fine-grained, high-fidelity, speech-locked facial expression retargeting.

Technology Category

Application Category

📝 Abstract
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.
Problem

Research questions and friction points this paper is trying to address.

Decoupling emotion and content in talking head videos
Preserving speech while changing facial expressions
Using contrastive learning for accurate emotion manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Decoupled Representation Learning algorithm
Contrastive Content Representation Learning module
Contrastive Emotion Representation Learning module
🔎 Similar Papers
No similar papers found.
T
Tianshui Chen
Guangdong University of Technology, Guangzhou, China
Jianman Lin
Jianman Lin
South China University of Technology
Computer visionimage generation/editingindoor scene generation.
Z
Zhijing Yang
Guangdong University of Technology, Guangzhou, China
C
Chumei Qing
South China University of Technology, Guangzhou, China
Y
Yukai Shi
Guangdong University of Technology, Guangzhou, China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis