FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses temporal inconsistency and generation instability in autoregressive text-to-speech (TTS) caused by modeling continuous-valued mel-spectrogram tokens. We propose the first end-to-end TTS framework that jointly leverages autoregressive language modeling and token-wise flow matching. Our method introduces: (1) a token-level coarse-to-fine flow matching mechanism to explicitly capture temporal dependencies in continuous spectrogram sequences; and (2) a dynamically updated flow matching prior that incorporates historical generation states to enhance temporal coherence. By integrating hierarchical conditional modeling with continuous-token flow matching, our approach achieves significant improvements in naturalness and prosodic quality on benchmarks including VCTK and LibriTTS. Comprehensive objective and subjective evaluations demonstrate superior performance over state-of-the-art autoregressive and non-autoregressive TTS systems.

Technology Category

Application Category

📝 Abstract
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
Problem

Research questions and friction points this paper is trying to address.

Enhancing continuous-valued token modeling
Improving temporal-coherence in speech synthesis
Hierarchical flow-matching for mel-spectrogram generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive model with flow matching
Coarse-to-fine token generation
Enhanced temporal coherence
🔎 Similar Papers
No similar papers found.