CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

๐Ÿ“… 2026-03-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing speaker diarization methods struggle with the challenges posed by open-domain scenarios such as films and TV shows, where speaker counts are large, audio-visual streams are often asynchronous, and environmental conditions are highly complex. This work proposes CineSRD, a novel framework that unifies visual, acoustic, and linguistic cues from video, speech, and subtitles for the first time. It leverages visual anchor clustering to register both on-screen and off-screen speakers and integrates an audio-language model to detect speaking turns. The authors construct and publicly release the first bilingual (Chineseโ€“English) speaker diarization benchmark dataset for cinematic content. Extensive experiments demonstrate that CineSRD achieves state-of-the-art performance on this new dataset and remains competitive on conventional benchmarks, confirming its robustness and generalization capability in open-domain settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
Problem

Research questions and friction points this paper is trying to address.

speaker diarization
open-world
visual media
multimodal
audiovisual programs
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal speaker diarization
open-world visual media
audio-visual asynchrony
audio language model
speaker registration
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Liangbin Huang
Hujing Digital Media and Entertainment Group
X
Xiaohua Liao
Hujing Digital Media and Entertainment Group
Chaoqun Cui
Chaoqun Cui
Institute of Automation, Chinese Academy of Sciences
Machine LearningNatural Language Processing
Shijing Wang
Shijing Wang
beijing jiaotong university
deep learning
Z
Zhaolong Huang
Hujing Digital Media and Entertainment Group
Y
Yanlong Du
Hujing Digital Media and Entertainment Group
Wenji Mao
Wenji Mao
Professor at Institute of Automation, Chinese Academy of Sciences
Artificial IntelligenceIntelligent AgentsSocial Modeling and Computing