RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing methods for generating virtual conversational avatars struggle to simultaneously achieve high realism, support multi-turn dialogue, and model social relationships. This work proposes a novel framework that integrates 3D Gaussian Splatting (3DGS) with mesh-driven facial animation, introducing a three-stage training paradigm to enable, for the first time, 3DGS-based generation of multi-character, socially aware talking avatars. We incorporate a learnable query mechanism to explicitly encode kinship/non-kinship and egalitarian/hierarchical social relations, and construct RSATalker—the first speech-mesh-image triplet dataset annotated with social relationship labels. Experiments demonstrate that our approach achieves state-of-the-art performance in both visual realism and social perception, enabling efficient rendering of high-quality avatars capable of engaging in multi-turn interactive conversations.

Technology Category

Application Category

📝 Abstract

Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.

Problem

Research questions and friction points this paper is trying to address.

talking head generation

multi-turn conversation

social awareness

3D Gaussian Splatting

virtual reality

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

socially-aware generation

talking head