π€ AI Summary
This work addresses performance bottlenecks in zero-shot text-to-speech synthesis caused by heuristic schedulers in Metric-Induced discrete flow matching, which rely on hyperparameter search and suffer from path-tracking errors over limited steps. The authors propose a training-free, kinetic-energy-optimal scheduler that traverses scalar-parameterized probability paths at constant FisherβRao speed, augmented with a moment-correction mechanism that preserves the target jump distribution. This approach yields the first kinetic-energy-optimal scheduling strategy derived for scalar probability paths. Integrated within the GibbsTTS framework using continuous-time Markov chain modeling and codec-based speech representations, it achieves state-of-the-art objective naturalness, outperforms mask-based discrete generation baselines in subjective evaluations, and attains the highest speaker similarity on three of four test sets (ranking second on the remaining one).
π Abstract
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject