Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

πŸ“… 2025-07-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the lack of systematic quality assessment for AI-generated talking heads (AGTHs), this paper introduces THQA-10Kβ€”the largest benchmark dataset to date, comprising 10,457 samples spanning 12 text-to-image and 14 talking-head generation models. We conduct the first comprehensive subjective-objective co-evaluation. Methodologically, we propose three novel objective metrics: (i) static quality assessment based on the first frame; (ii) dynamic consistency measurement via Y-T slice-based temporal modeling; and (iii) quantitative lip-sync accuracy estimation. These metrics are rigorously calibrated through large-scale human perception studies. Experiments demonstrate that our approach achieves state-of-the-art performance in AGTH quality evaluation. The dataset, source code, and standardized evaluation protocol are publicly released to establish a foundational benchmark for digital human media quality assessment.

Technology Category

Application Category

πŸ“ Abstract
Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.
Problem

Research questions and friction points this paper is trying to address.

Assessing quality of AI-generated talking heads (AGTHs)
Evaluating generalizability and distortions in existing AGTHs
Developing objective quality assessment method for AGTHs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest AGTH dataset THQA-10K with 10,457 samples
Subjective rating by volunteers for quality assessment
Objective method using frame, slice, and consistency
πŸ”Ž Similar Papers
Y
Yingjie Zhou
Shanghai Jiao Tong University, PengCheng Laboratory
Jiezhang Cao
Jiezhang Cao
Harvard University | ETH ZΓΌrich
Image RestorationImage GenerationComputer Vision
Z
Zicheng Zhang
Shanghai Jiao Tong University, PengCheng Laboratory
Farong Wen
Farong Wen
Student, Shanghai Jiao Tong University | Shanghai AI Laboratory
Intelligent Digital HumanLarge Lauguage ModelAI Evaluation
Y
Yanwei Jiang
Shanghai Jiao Tong University, PengCheng Laboratory
J
Jun Jia
Shanghai Jiao Tong University, PengCheng Laboratory
X
Xiaohong Liu
Shanghai Jiao Tong University, PengCheng Laboratory
X
Xiongkuo Min
Shanghai Jiao Tong University, PengCheng Laboratory
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays