EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing speech-driven multi-speaker talking-head generation methods suffer from quality degradation, limiting visual realism and user experience. To address this, we propose EvalTalker—the first quality evaluation framework specifically designed for realistic portrait-driven multi-speaker generation. Our contributions are threefold: (1) We introduce THQA-MT, the first large-scale, human-annotated multi-speaker quality assessment dataset; (2) We design a multi-dimensional evaluation model integrating global quality, identity consistency, speaker-specific attribute fidelity, and audio-visual synchronization; (3) We incorporate Qwen-Sync for fine-grained multimodal synchronization modeling, validated via comprehensive subjective studies. Extensive experiments demonstrate that EvalTalker achieves strong correlation with human judgments (Spearman’s ρ > 0.92), significantly outperforming prior metrics. EvalTalker establishes a reproducible, interpretable, and human-aligned benchmark for advancing high-fidelity multi-speaker talking-head generation.

Technology Category

Application Category

📝 Abstract

Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluates quality of multi-subject talking human videos

Identifies common distortions in generated talking human content

Assesses multimodal synchrony and identity consistency in animations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed THQA-MT dataset for Multi-Talker quality assessment

Introduced EvalTalker framework to evaluate global and identity consistency

Integrated Qwen-Sync to perceive multimodal synchrony in evaluation

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

2024-09-10arXiv.orgCitations: 0

MAD Speech: Measures of Acoustic Diversity of Speech

2024-04-16arXiv.orgCitations: 1