FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses temporal inconsistency and generation instability in autoregressive text-to-speech (TTS) caused by modeling continuous-valued mel-spectrogram tokens. We propose the first end-to-end TTS framework that jointly leverages autoregressive language modeling and token-wise flow matching. Our method introduces: (1) a token-level coarse-to-fine flow matching mechanism to explicitly capture temporal dependencies in continuous spectrogram sequences; and (2) a dynamically updated flow matching prior that incorporates historical generation states to enhance temporal coherence. By integrating hierarchical conditional modeling with continuous-token flow matching, our approach achieves significant improvements in naturalness and prosodic quality on benchmarks including VCTK and LibriTTS. Comprehensive objective and subjective evaluations demonstrate superior performance over state-of-the-art autoregressive and non-autoregressive TTS systems.

Technology Category

Application Category

📝 Abstract

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

Problem

Research questions and friction points this paper is trying to address.

Enhancing continuous-valued token modeling

Improving temporal-coherence in speech synthesis

Hierarchical flow-matching for mel-spectrogram generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive model with flow matching

Coarse-to-fine token generation

Enhanced temporal coherence

🔎 Similar Papers

Autoregressive Speech Synthesis without Vector Quantization