Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling

📅 2024-04-14
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural speaker diarization suffers from strong reliance on pre-specified speaker counts and insufficient modeling of local temporal dynamics. To address these issues, this paper proposes an end-to-end trainable framework featuring three key innovations: (1) a novel speaker-attribute attractor paradigm that explicitly models fine-grained acoustic attributes via multi-stage intermediate representations; (2) the first integration of the Conformer architecture into diarization to enhance local speech temporal dependency modeling; and (3) a joint design of speaker-attribute embeddings with differentiable clustering, eliminating the need for fixed speaker count assumptions. Evaluated on the CALLHOME dataset, our method achieves significant DER reduction and demonstrates superior robustness and generalization compared to both EDA and pure Transformer baselines. Experimental results validate the effectiveness of synergistically combining attribute-driven attractors with enhanced local modeling for speaker diarization.

Technology Category

Application Category

📝 Abstract
In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed ‘speaker attributes’ through a multi-stage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved diarization performance on the CALLHOME dataset.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speaker diarization via detailed speaker attribute modeling
Replacing transformers with conformers for local dependency capture
Improving performance on multi-talker recordings like CALLHOME
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage speaker attribute attractors
Conformer-based local dependency modeling
Enhanced EDA for variable speaker counts
🔎 Similar Papers
No similar papers found.
D
David Palzer
The Ohio State University, Computer Science and Engineering
Matthew Maciejewski
Matthew Maciejewski
Johns Hopkins University
speech separationspeaker diarizationspeaker identification
E
E. Fosler-Lussier
The Ohio State University, Computer Science and Engineering