🤖 AI Summary
Existing speech-driven multi-speaker talking-head generation methods suffer from quality degradation, limiting visual realism and user experience. To address this, we propose EvalTalker—the first quality evaluation framework specifically designed for realistic portrait-driven multi-speaker generation. Our contributions are threefold: (1) We introduce THQA-MT, the first large-scale, human-annotated multi-speaker quality assessment dataset; (2) We design a multi-dimensional evaluation model integrating global quality, identity consistency, speaker-specific attribute fidelity, and audio-visual synchronization; (3) We incorporate Qwen-Sync for fine-grained multimodal synchronization modeling, validated via comprehensive subjective studies. Extensive experiments demonstrate that EvalTalker achieves strong correlation with human judgments (Spearman’s ρ > 0.92), significantly outperforming prior metrics. EvalTalker establishes a reproducible, interpretable, and human-aligned benchmark for advancing high-fidelity multi-speaker talking-head generation.
📝 Abstract
Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.