SF-Speech: Straightened Flow for Zero-Shot Voice Clone on Small-Scale Dataset

📅 2024-10-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot voice cloning, flow-matching ODE models suffer from high trajectory curvature and inefficient multi-step sampling due to the restrictive assumption of a standard Gaussian initial distribution. To address this, we propose SF-Speech: a lightweight, multi-stage initialization framework integrating neural ODEs with context learning. Its core innovation is the first loss-free joint optimization of ODE backward trajectory linearization, achieved via multi-stage distribution calibration that yields deterministic initial states—significantly reducing trajectory curvature. Evaluated on multi-scale datasets, SF-Speech surpasses state-of-the-art methods including Voicebox and E2-TTS, reducing inference steps by 75% and accelerating generation by 3.7×, while simultaneously improving zero-shot generalization under low-data regimes.

Technology Category

Application Category

📝 Abstract
Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo pagefootnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.
Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot voice cloning with straightened ODE trajectories
Reduces intersections in flow matching targets for better training
Enhances generation speed with fewer solver steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage module generates deterministic initial distribution
Joint training straightens ODE reverse trajectories
Lightweight design enhances speed and performance
🔎 Similar Papers
No similar papers found.
X
Xuyuan Li
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, China
Zengqiang Shang
Zengqiang Shang
Institute of Acoustics Chinese Academy of Sciences
speech
Hua Hua
Hua Hua
Tencent
Deep LearningQualitative Spatial Reasoning
Peiyang Shi
Peiyang Shi
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, China
C
Chen Yang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, China
L
Li Wang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, China
P
Pengyuan Zhang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, China