EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing talking-head generation methods predominantly produce unidirectional animations, while few bidirectional interactive approaches lack fine-grained emotional modeling—limiting perceptual realism and practical utility. To address this, we propose the first emotion-aware framework for two-party interactive talking-head generation. Our method introduces: (1) an emotion-aware dialogue tree built from conversational history to dynamically switch between listening and speaking states; (2) a latent-space temporally consistent head mask generation mechanism, integrating a Transformer-based mask generator with an LLM-driven dialogue engine; and (3) a hierarchical traversal emotion-guidance strategy coupled with an emotion-adaptive facial expression controller. Extensive experiments demonstrate significant improvements over state-of-the-art methods in three critical dimensions: long-sequence coherence, high emotional density fidelity, and strong temporal continuity. The framework substantially enhances the naturalness and emotional expressiveness of virtual agent interactions.

Technology Category

Application Category

📝 Abstract

Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Generating emotion-aware talking heads for interactive conversations

Enabling seamless transitions between speaking and listening states

Producing temporally consistent avatars with emotional variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion-aware interactive talking head generation

Transformer-based head mask generator

Interactive talking tree structure traversal

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads