Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for emotional editing of talking-face videos struggle to disentangle linguistic content from emotional expression and rely heavily on high-quality reference images, thereby limiting expressive flexibility and the ability to generate extended emotional states. This work proposes C-MET, a novel approach that, for the first time, enables reference-free emotional editing through cross-modal emotion semantic vectors, supporting highly expressive synthesis of both seen and unseen emotions—including nuanced states such as sarcasm. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn cross-modal emotion embeddings that capture the discrepancies between vocal and visual affective cues. Evaluated on the MEAD and CREMA-D datasets, the method achieves a 14% improvement in emotion accuracy over the current state of the art while generating natural, high-fidelity emotional talking-face videos.
📝 Abstract
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
Problem

Research questions and friction points this paper is trying to address.

emotion editing
talking face video
cross-modal emotion transfer
extended emotions
facial expression generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Emotion Transfer
Talking Face Generation
Emotion Editing
Disentangled Expression Encoding
Speech-to-Face Emotion Mapping
🔎 Similar Papers
No similar papers found.
C
Chanhyuk Choi
Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea
Taesoo Kim
Taesoo Kim
Georgia Institute of Technology
SecurityOperating SystemSystems
D
Donggyu Lee
Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea
S
Siyeol Jung
Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea
Taehwan Kim
Taehwan Kim
UNIST
Machine LearningComputer VisionLanguage Processing