đ¤ AI Summary
State-of-the-art active speaker detection (ASD) models perform well on synthetic benchmarks like AVA but exhibit severe generalization failure in real-world scenariosâcharacterized by multilingual speech, heavy acoustic noise, and overlapping speakers.
Method: We introduce UniTalk, the first real-world-oriented universal ASD benchmark, comprising 44.5 hours of video and 48,693 frame-level speaker annotations. It systematically models practical challenges and mitigates domain shift. We propose a multimodal temporal alignment architecture that fuses visual and audio features, coupled with a robust labeling strategy enabling end-to-end training and cross-domain evaluation.
Results: SOTA models suffer substantial performance degradation on UniTalk, confirming its rigor. Conversely, models trained on UniTalk achieve superior generalization across diverse benchmarksâincluding Talkies, ASW, and AVAâestablishing a new evaluation paradigm and a practical, realistic standard for ASD.
đ Abstract
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern"in-the-wild"datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code