Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling

📅 2024-04-14

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Neural speaker diarization suffers from strong reliance on pre-specified speaker counts and insufficient modeling of local temporal dynamics. To address these issues, this paper proposes an end-to-end trainable framework featuring three key innovations: (1) a novel speaker-attribute attractor paradigm that explicitly models fine-grained acoustic attributes via multi-stage intermediate representations; (2) the first integration of the Conformer architecture into diarization to enhance local speech temporal dependency modeling; and (3) a joint design of speaker-attribute embeddings with differentiable clustering, eliminating the need for fixed speaker count assumptions. Evaluated on the CALLHOME dataset, our method achieves significant DER reduction and demonstrates superior robustness and generalization compared to both EDA and pure Transformer baselines. Experimental results validate the effectiveness of synergistically combining attribute-driven attractors with enhanced local modeling for speaker diarization.

Technology Category

Application Category

📝 Abstract

In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed ‘speaker attributes’ through a multi-stage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved diarization performance on the CALLHOME dataset.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speaker diarization via detailed speaker attribute modeling

Replacing transformers with conformers for local dependency capture

Improving performance on multi-talker recordings like CALLHOME

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage speaker attribute attractors

Conformer-based local dependency modeling

Enhanced EDA for variable speaker counts

🔎 Similar Papers

No similar papers found.