🤖 AI Summary
To address low speaker diarization accuracy in ASR transcripts and poor generalizability of existing methods across diverse ASR systems, this paper proposes an ASR-agnostic large language model (LLM) post-processing correction framework. The method fine-tunes an LLM to directly rectify speaker label errors in raw ASR outputs. Crucially, it introduces a novel multi-ASR aligned dataset and employs ensemble fine-tuning—integrating weights from multiple ASR sources—to achieve robust cross-tool generalization. Evaluated on the held-out Fisher corpus and an independent test set, the framework significantly improves diarization accuracy. The model is publicly released on Hugging Face for plug-and-play deployment. Key contributions include: (1) the first ASR-agnostic LLM-based diarization correction paradigm, eliminating reliance on any specific ASR system; (2) a scalable, data-driven approach that enhances both accuracy and cross-ASR robustness; and (3) an open-source implementation enabling reproducible, practical adoption.
📝 Abstract
Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.