LLM-based speaker diarization correction: A generalizable approach

📅 2024-06-07
🏛️ Speech Communication
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address low speaker diarization accuracy in ASR transcripts and poor generalizability of existing methods across diverse ASR systems, this paper proposes an ASR-agnostic large language model (LLM) post-processing correction framework. The method fine-tunes an LLM to directly rectify speaker label errors in raw ASR outputs. Crucially, it introduces a novel multi-ASR aligned dataset and employs ensemble fine-tuning—integrating weights from multiple ASR sources—to achieve robust cross-tool generalization. Evaluated on the held-out Fisher corpus and an independent test set, the framework significantly improves diarization accuracy. The model is publicly released on Hugging Face for plug-and-play deployment. Key contributions include: (1) the first ASR-agnostic LLM-based diarization correction paradigm, eliminating reliance on any specific ASR system; (2) a scalable, data-driven approach that enhances both accuracy and cross-ASR robustness; and (3) an open-source implementation enabling reproducible, practical adoption.

Technology Category

Application Category

📝 Abstract
Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.
Problem

Research questions and friction points this paper is trying to address.

Improving speaker diarization accuracy using LLMs
Addressing generalizability issues in diarization correction
Developing an ASR-agnostic ensemble model for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs fine-tuned for diarization correction
Ensemble model combining multiple ASR tools
Publicly available model weights on HuggingFace
🔎 Similar Papers
G
Georgios Efstathiadis
Brooklyn Health, Brooklyn, NY , 11201 USA; Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115 USA
V
Vijay Yadav
Brooklyn Health, Brooklyn, NY , 11201 USA; School of Psychology, University of New South Wales, Sydney, NSW 2052, Australia
A
Anzar Abbas
Brooklyn Health, Brooklyn, NY , 11201 USA