Interactive Real-Time Speaker Diarization Correction with Human Feedback

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current automatic speech processing systems predominantly adopt open-loop architectures, rendering them incapable of real-time correction in response to user feedback on speaker identity—thus impeding error rectification in speaker diarization. This work introduces the first closed-loop, human-in-the-loop speaker diarization correction system enabling real-time collaboration. Our approach integrates streaming ASR, online speaker registration, and LLM-based summary generation to establish a low-latency interactive pipeline. We propose Split-and-Merge Segmentation (SWM), a novel technique that precisely disentangles erroneously merged multi-speaker segments. Furthermore, we leverage users’ spoken feedback to dynamically refine subsequent segmentation decisions. Evaluated via simulation on the AMI dataset, our system reduces the diarization error rate (DER) by 9.92% and decreases speaker confusion errors by 44.23%, demonstrating the effectiveness of both the closed-loop correction mechanism and feedback-driven optimization.

Technology Category

Application Category

📝 Abstract
Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.
Problem

Research questions and friction points this paper is trying to address.

Correcting speaker attribution errors in real-time diarization systems
Enabling human feedback integration without disrupting conversation flow
Reducing speaker diarization errors through interactive correction techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted real-time speaker diarization correction system
Split-when-merged technique for detecting multi-speaker segments
Online speaker enrollments based on user corrections
🔎 Similar Papers
No similar papers found.
Xinlu He
Xinlu He
Wocester Polytechnic Institute
machine learningdeep learningmulti-modality
Y
Yiwen Guan
Worcester Polytechnic Institute, USA
B
Badrivishal Paurana
Worcester Polytechnic Institute, USA
Z
Zilin Dai
Worcester Polytechnic Institute, USA
Jacob Whitehill
Jacob Whitehill
Worcester Polytechnic Institute
Artificial Intelligence