🤖 AI Summary
This work addresses the longstanding challenge in speech enhancement and separation of simultaneously achieving high perceptual quality and task fidelity. To this end, the authors propose SIPS, a novel framework that, for the first time, leverages stochastic interpolation dynamics to seamlessly integrate pretrained predictive models—such as SEMamba or FlexIO—with generative priors trained exclusively on clean speech. This plug-and-play fusion introduces task-specific deterministic drift and stochastic denoising components during sampling. Notably, SIPS is architecture-agnostic and degradation-agnostic, enabling versatile and unified joint inference across diverse scenarios. Experimental results demonstrate that SIPS substantially improves perceptual naturalness, yielding gains of up to 1.0 in NISQA score on speech separation tasks.
📝 Abstract
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.