Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription

📅 2023-09-15

🏛️ Spoken Language Technology Workshop

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenging problem of continuous speaker-separated transcription in multi-talker meetings, where the number of speakers is unknown, speech segments dynamically overlap, and only a single-channel recording is available. We propose an end-to-end time-frequency (T-F) modeling framework. Our method integrates a scalable mixture encoder—adapted for realistic meeting conditions for the first time—with the TF-GridNet architecture and a unified end-to-end learning pipeline. The mixture encoder explicitly handles variable speaker counts and time-varying overlap structures, while TF-GridNet demonstrates superior modeling capacity for natural overlapping speech, substantially narrowing the performance gap with oracle separation. Evaluated on the LibriCSS single-microphone dataset, our approach achieves new state-of-the-art separation quality—approaching the oracle upper bound—and significantly improves downstream automatic speech recognition (ASR) robustness.

📝 Abstract

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

Problem

Research questions and friction points this paper is trying to address.

Extends mixture encoder to multi-speaker meetings

Integrates TF-GridNet for speech separation enhancement

Achieves state-of-the-art in meeting transcription accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended mixture encoder to multiple speakers

Integrated TF-GridNet for enhanced separation

Achieved state-of-the-art with single microphone

🔎 Similar Papers

No similar papers found.