🤖 AI Summary
Real-world recordings are often degraded by overlapping speech from multiple speakers and complex non-additive distortions, which pose significant challenges for conventional speech enhancement and separation methods. To address this issue, this work proposes Geneses, a novel framework that, for the first time, integrates latent flow matching with a multimodal diffusion Transformer to enable end-to-end joint speech enhancement and separation using self-supervised representations. Evaluated on the LibriTTS-R dataset, the proposed method substantially outperforms traditional masking-based approaches across multiple objective metrics and demonstrates strong robustness under diverse and challenging degradation conditions.
📝 Abstract
Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex degradations. The results demonstrate that Geneses significantly outperforms a conventional mask-based SE--SS method across various objective metrics with high robustness against complex degradations. Audio samples are available in our demo page.