RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods for generating virtual conversational avatars struggle to simultaneously achieve high realism, support multi-turn dialogue, and model social relationships. This work proposes a novel framework that integrates 3D Gaussian Splatting (3DGS) with mesh-driven facial animation, introducing a three-stage training paradigm to enable, for the first time, 3DGS-based generation of multi-character, socially aware talking avatars. We incorporate a learnable query mechanism to explicitly encode kinship/non-kinship and egalitarian/hierarchical social relations, and construct RSATalkerβ€”the first speech-mesh-image triplet dataset annotated with social relationship labels. Experiments demonstrate that our approach achieves state-of-the-art performance in both visual realism and social perception, enabling efficient rendering of high-quality avatars capable of engaging in multi-turn interactive conversations.

Technology Category

Application Category

πŸ“ Abstract
Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
Problem

Research questions and friction points this paper is trying to address.

talking head generation
multi-turn conversation
social awareness
3D Gaussian Splatting
virtual reality
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting
socially-aware generation
talking head
multi-turn conversation
facial animation
πŸ”Ž Similar Papers
No similar papers found.
Peng Chen
Peng Chen
Institute of Software, Chinese Academy of Sciences
MLLMAIGC3D Vision
Xiaobao Wei
Xiaobao Wei
Institute of Software, Chinese Academy of Sciences
3D Vision
Y
Yi Yang
Institute of Software, Chinese Academy of Sciences
N
Naiming Yao
Institute of Software, Chinese Academy of Sciences
H
Hui Chen
Institute of Software, Chinese Academy of Sciences
F
Feng Tian
Institute of Software, Chinese Academy of Sciences