🤖 AI Summary
This work addresses the challenging problem of continuous speaker-separated transcription in multi-talker meetings, where the number of speakers is unknown, speech segments dynamically overlap, and only a single-channel recording is available. We propose an end-to-end time-frequency (T-F) modeling framework. Our method integrates a scalable mixture encoder—adapted for realistic meeting conditions for the first time—with the TF-GridNet architecture and a unified end-to-end learning pipeline. The mixture encoder explicitly handles variable speaker counts and time-varying overlap structures, while TF-GridNet demonstrates superior modeling capacity for natural overlapping speech, substantially narrowing the performance gap with oracle separation. Evaluated on the LibriCSS single-microphone dataset, our approach achieves new state-of-the-art separation quality—approaching the oracle upper bound—and significantly improves downstream automatic speech recognition (ASR) robustness.
📝 Abstract
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.