🤖 AI Summary
Existing methods for emotional editing of talking-face videos struggle to disentangle linguistic content from emotional expression and rely heavily on high-quality reference images, thereby limiting expressive flexibility and the ability to generate extended emotional states. This work proposes C-MET, a novel approach that, for the first time, enables reference-free emotional editing through cross-modal emotion semantic vectors, supporting highly expressive synthesis of both seen and unseen emotions—including nuanced states such as sarcasm. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn cross-modal emotion embeddings that capture the discrepancies between vocal and visual affective cues. Evaluated on the MEAD and CREMA-D datasets, the method achieves a 14% improvement in emotion accuracy over the current state of the art while generating natural, high-fidelity emotional talking-face videos.
📝 Abstract
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/