🤖 AI Summary
Neural speaker diarization suffers from strong reliance on pre-specified speaker counts and insufficient modeling of local temporal dynamics. To address these issues, this paper proposes an end-to-end trainable framework featuring three key innovations: (1) a novel speaker-attribute attractor paradigm that explicitly models fine-grained acoustic attributes via multi-stage intermediate representations; (2) the first integration of the Conformer architecture into diarization to enhance local speech temporal dependency modeling; and (3) a joint design of speaker-attribute embeddings with differentiable clustering, eliminating the need for fixed speaker count assumptions. Evaluated on the CALLHOME dataset, our method achieves significant DER reduction and demonstrates superior robustness and generalization compared to both EDA and pure Transformer baselines. Experimental results validate the effectiveness of synergistically combining attribute-driven attractors with enhanced local modeling for speaker diarization.
📝 Abstract
In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed ‘speaker attributes’ through a multi-stage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved diarization performance on the CALLHOME dataset.