🤖 AI Summary
Existing talking-head generation methods predominantly produce unidirectional animations, while few bidirectional interactive approaches lack fine-grained emotional modeling—limiting perceptual realism and practical utility. To address this, we propose the first emotion-aware framework for two-party interactive talking-head generation. Our method introduces: (1) an emotion-aware dialogue tree built from conversational history to dynamically switch between listening and speaking states; (2) a latent-space temporally consistent head mask generation mechanism, integrating a Transformer-based mask generator with an LLM-driven dialogue engine; and (3) a hierarchical traversal emotion-guidance strategy coupled with an emotion-adaptive facial expression controller. Extensive experiments demonstrate significant improvements over state-of-the-art methods in three critical dimensions: long-sequence coherence, high emotional density fidelity, and strong temporal continuity. The framework substantially enhances the naturalness and emotional expressiveness of virtual agent interactions.
📝 Abstract
Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.