🤖 AI Summary
To address the trade-off between speech naturalness and inference efficiency in flow-matching (FM)-based text-to-speech (TTS) models under the “coarse-to-fine” generation paradigm, this paper proposes Shallow Flow Matching (SFM). SFM constructs an intermediate state in the FM trajectory conditioned on coarse-grained outputs and initiates inference from this state to focus exclusively on fine-grained modeling in the latter segment. It innovatively employs orthogonal projection for adaptive temporal alignment of the intermediate state, introduces a single-segment piecewise flow formulation, and designs a lightweight SFM head coupled with an adaptive-step ODE solver. SFM is the first systematically integrated FM variant across diverse mainstream TTS architectures. Experiments demonstrate that SFM maintains or even improves speech naturalness—evidenced by gains in objective metrics and statistically significant MOS improvements—while substantially accelerating inference. Code, pretrained models, and an online demo are publicly released.
📝 Abstract
We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.