🤖 AI Summary
Current automatic speech processing systems predominantly adopt open-loop architectures, rendering them incapable of real-time correction in response to user feedback on speaker identity—thus impeding error rectification in speaker diarization. This work introduces the first closed-loop, human-in-the-loop speaker diarization correction system enabling real-time collaboration. Our approach integrates streaming ASR, online speaker registration, and LLM-based summary generation to establish a low-latency interactive pipeline. We propose Split-and-Merge Segmentation (SWM), a novel technique that precisely disentangles erroneously merged multi-speaker segments. Furthermore, we leverage users’ spoken feedback to dynamically refine subsequent segmentation decisions. Evaluated via simulation on the AMI dataset, our system reduces the diarization error rate (DER) by 9.92% and decreases speaker confusion errors by 44.23%, demonstrating the effectiveness of both the closed-loop correction mechanism and feedback-driven optimization.
📝 Abstract
Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.