🤖 AI Summary
Diffusion-based speech enhancement suffers from slow inference—particularly with Schrödinger bridge (SB) methods—and a fundamental decoupling between quality and step count in consistency models. To address these dual bottlenecks, we propose the Schrödinger Bridge Consistency Trajectory Model (SBCTM). SBCTM tightly integrates end-to-end denoising modeling of the SB framework with consistency trajectory training, augmented by a perception-driven auxiliary loss for holistic end-to-end optimization. Leveraging knowledge distillation and multi-step refinement strategies, SBCTM enables flexible single-step or multi-step inference without compromising speech quality. Experimental results demonstrate a ~16× reduction in real-time factor (RTF), achieving unprecedented synergy between high perceptual quality and inference efficiency. To our knowledge, SBCTM is the first diffusion-based speech enhancement method to simultaneously attain state-of-the-art quality and near-real-time inference speed.
📝 Abstract
Speech enhancement (SE) utilizing diffusion models is a promising technology that improves speech quality in noisy speech data. Furthermore, the Schrödinger bridge (SB) has recently been used in diffusion-based SE to improve speech quality by resolving a mismatch between the endpoint of the forward process and the starting point of the reverse process. However, the SB still exhibits slow inference owing to the necessity of a large number of function evaluations (NFE) for inference to obtain high-quality results. While Consistency Models (CMs) address this issue by employing consistency training that uses distillation from pretrained models in the field of image generation, it does not improve generation quality when the number of steps increases. As a solution to this problem, Consistency Trajectory Models (CTMs) not only accelerate inference speed but also maintain a favorable trade-off between quality and speed. Furthermore, SoundCTM demonstrates the applicability of CTM techniques to the field of sound generation. In this paper, we present Schrödinger bridge Consistency Trajectory Models (SBCTM) by applying the CTM's technique to the Schrödinger bridge for SE. Additionally, we introduce a novel auxiliary loss, including a perceptual loss, into the original CTM's training framework. As a result, SBCTM achieves an approximately 16x improvement in the real-time factor (RTF) compared to the conventional Schrödinger bridge for SE. Furthermore, the favorable trade-off between quality and speed in SBCTM allows for time-efficient inference by limiting multi-step refinement to cases where 1-step inference is insufficient. Our code, pretrained models, and audio samples are available at https://github.com/sony/sbctm/.