Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the trade-off between speech naturalness and inference efficiency in flow-matching (FM)-based text-to-speech (TTS) models under the “coarse-to-fine” generation paradigm, this paper proposes Shallow Flow Matching (SFM). SFM constructs an intermediate state in the FM trajectory conditioned on coarse-grained outputs and initiates inference from this state to focus exclusively on fine-grained modeling in the latter segment. It innovatively employs orthogonal projection for adaptive temporal alignment of the intermediate state, introduces a single-segment piecewise flow formulation, and designs a lightweight SFM head coupled with an adaptive-step ODE solver. SFM is the first systematically integrated FM variant across diverse mainstream TTS architectures. Experiments demonstrate that SFM maintains or even improves speech naturalness—evidenced by gains in objective metrics and statistically significant MOS improvements—while substantially accelerating inference. Code, pretrained models, and an online demo are publicly released.

Technology Category

Application Category

📝 Abstract

We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.

Problem

Research questions and friction points this paper is trying to address.

Enhance flow matching for coarse-to-fine text-to-speech synthesis

Improve synthesized speech naturalness with adaptive intermediate states

Reduce inference time using lightweight flow matching heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shallow flow matching enhances TTS synthesis

Orthogonal projection optimizes temporal state positioning

Lightweight SFM head reduces inference computation

🔎 Similar Papers

No similar papers found.