🤖 AI Summary
Diffusion and flow-matching models for speech enhancement suffer from multi-step sampling, high computational overhead, and sensitivity to discretization error. To address these issues, this paper proposes COSE, a single-step generative framework. Its core innovation lies in reconstructing the dynamical process via an average velocity field, efficiently computed using a velocity composition identity—thereby avoiding costly Jacobian-vector products. Theoretically consistent with continuous-time flow matching and preserving speech enhancement quality, COSE significantly reduces both training and inference complexity. On standard benchmarks, it achieves up to 5× sampling speedup and reduces training cost by 40%, while maintaining high fidelity and perceptual quality.
📝 Abstract
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly MeanFlow, provide a promising alternative by reformulating dynamics through average velocity fields. In this work, we present COSE, a one-step FM framework tailored for SE. To address the high training overhead of Jacobian-vector product (JVP) computations in MeanFlow, we introduce a velocity composition identity to compute average velocity efficiently, eliminating expensive computation while preserving theoretical consistency and achieving competitive enhancement quality. Extensive experiments on standard benchmarks show that COSE delivers up to 5x faster sampling and reduces training cost by 40%, all without compromising speech quality. Code is available at https://github.com/ICDM-UESTC/COSE.