Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses performance bottlenecks in zero-shot text-to-speech synthesis caused by heuristic schedulers in Metric-Induced discrete flow matching, which rely on hyperparameter search and suffer from path-tracking errors over limited steps. The authors propose a training-free, kinetic-energy-optimal scheduler that traverses scalar-parameterized probability paths at constant Fisher–Rao speed, augmented with a moment-correction mechanism that preserves the target jump distribution. This approach yields the first kinetic-energy-optimal scheduling strategy derived for scalar probability paths. Integrated within the GibbsTTS framework using continuous-time Markov chain modeling and codec-based speech representations, it achieves state-of-the-art objective naturalness, outperforms mask-based discrete generation baselines in subjective evaluations, and attains the highest speaker similarity on three of four test sets (ranking second on the remaining one).

📝 Abstract

Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

Problem

Research questions and friction points this paper is trying to address.

discrete flow matching

scheduling

path-tracking error

continuous-time Markov chain

zero-shot text-to-speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

kinetic-optimal scheduling

moment correction

discrete flow matching