🤖 AI Summary
This work addresses the challenges of reproducing and extending DiariZen—an open-source state-of-the-art speaker diarization system—stemming from its cross-library and cross-framework dependencies. We propose the first self-contained, visualizable, and code-aligned modular decomposition of the DiariZen pipeline, structured into seven stages: audio preprocessing, WavLM-Large feature extraction (incorporating structured pruning and layer weighting), Conformer-based backend modeling, powerset classification, VBx clustering, and PLDA scoring. Accompanied by executable scripts and visualization examples, our implementation significantly lowers the barrier to entry for researchers, achieves open-source state-of-the-art performance across multiple benchmarks, and enables fully reproducible experimentation and pedagogical demonstration through comprehensive open-source tutorials.
📝 Abstract
Speaker diarization (SD) is the task of answering "who spoke when" in a multi-speaker audio stream. Classically, an SD system clusters segments of speech belonging to an individual speaker's identity. Recent years have seen substantial progress in SD through end-to-end neural diarization (EEND) approaches. DiariZen, a hybrid SD pipeline built upon a structurally pruned WavLM-Large encoder, a Conformer backend with powerset classification, and VBx clustering, represents the leading open-source state of the art at the time of writing across multiple benchmarks. Despite its strong performance, the DiariZen architecture spans several repositories and frameworks, making it difficult for researchers and practitioners to understand, reproduce, or extend the system as a whole. This tutorial paper provides a self-contained, block-by-block explanation of the complete DiariZen pipeline, decomposing it into seven stages: (1) audio loading and sliding window segmentation, (2) WavLM feature extraction with learned layer weighting, (3) Conformer backend and powerset classification, (4) segmentation aggregation via overlap-add, (5) speaker embedding extraction with overlap exclusion, (6) VBx clustering with PLDA scoring, and (7) reconstruction and RTTM output. For each block, we provide the conceptual motivation, source code references, intermediate tensor shapes, and annotated visualizations of the actual outputs on a 30s excerpt from the AMI Meeting Corpus. The implementation is available at https://github.com/nikhilraghav29/diarizen-tutorial, which includes standalone executable scripts for each block and a Jupyter notebook that runs the complete pipeline end-to-end.