🤖 AI Summary
This work addresses the limitation of existing talking-face generation methods, which often produce faces with static, single-emotion expressions and fail to model natural, continuous emotional dynamics. To overcome this, we introduce a novel task of emotionally coherent talking-face synthesis, driven jointly by input text and dynamic emotion descriptors to generate realistic videos where facial expressions align precisely with both speech content and nuanced affective states. We propose the first temporally dense emotion fluctuation modeling mechanism and develop a tailored TIE-TFG architecture incorporating temporal emotion modulation, ensuring fine-grained synchronization between facial dynamics and textual semantics. Experimental results demonstrate that our approach consistently yields high-quality videos with smooth emotional transitions and lifelike facial movements across diverse affective conditions.
📝 Abstract
Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.