Error Analysis in a Modular Meeting Transcription System

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In meeting transcription, speech separation suffers from inter-channel leakage—particularly during single-active-speaker segments—and the impact of voice activity detection (VAD) and segmentation strategies on end-to-end performance remains poorly understood. This paper introduces a temporally sensitive leakage analysis framework, the first to explicitly model how separation quality depends on local temporal structure; it reveals that inter-channel leakage has limited impact on end-to-end transcription accuracy due to ASR’s inherent robustness. We systematically evaluate diverse segmentation strategies and, on LibriCSS, demonstrate that state-of-the-art speaker diarization reduces the gap to oracle segmentation by approximately one-third compared to energy-based VAD. Adopting a modular architecture—comprising VAD, speaker clustering, and separation—we train the ASR module exclusively on LibriSpeech and achieve SOTA word error rate on LibriCSS, validating strong cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only.

Problem

Research questions and friction points this paper is trying to address.

Analyzing leakage in speech separation systems

Evaluating voice activity detection impact on performance

Comparing segmentation methods for meeting transcription

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended leakage analysis framework with temporal sensitivity

Compared advanced diarization with energy-based VAD

Achieved state-of-art performance using LibriSpeech-trained recognition

🔎 Similar Papers

No similar papers found.