SF-Speech: Straightened Flow for Zero-Shot Voice Clone on Small-Scale Dataset

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In zero-shot voice cloning, flow-matching ODE models suffer from high trajectory curvature and inefficient multi-step sampling due to the restrictive assumption of a standard Gaussian initial distribution. To address this, we propose SF-Speech: a lightweight, multi-stage initialization framework integrating neural ODEs with context learning. Its core innovation is the first loss-free joint optimization of ODE backward trajectory linearization, achieved via multi-stage distribution calibration that yields deterministic initial states—significantly reducing trajectory curvature. Evaluated on multi-scale datasets, SF-Speech surpasses state-of-the-art methods including Voicebox and E2-TTS, reducing inference steps by 75% and accelerating generation by 3.7×, while simultaneously improving zero-shot generalization under low-data regimes.

Technology Category

Application Category

📝 Abstract

Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo pagefootnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.

Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot voice cloning with straightened ODE trajectories

Reduces intersections in flow matching targets for better training

Enhances generation speed with fewer solver steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage module generates deterministic initial distribution

Joint training straightens ODE reverse trajectories

Lightweight design enhances speed and performance

🔎 Similar Papers

No similar papers found.