🤖 AI Summary
This paper addresses temporal inconsistency and generation instability in autoregressive text-to-speech (TTS) caused by modeling continuous-valued mel-spectrogram tokens. We propose the first end-to-end TTS framework that jointly leverages autoregressive language modeling and token-wise flow matching. Our method introduces: (1) a token-level coarse-to-fine flow matching mechanism to explicitly capture temporal dependencies in continuous spectrogram sequences; and (2) a dynamically updated flow matching prior that incorporates historical generation states to enhance temporal coherence. By integrating hierarchical conditional modeling with continuous-token flow matching, our approach achieves significant improvements in naturalness and prosodic quality on benchmarks including VCTK and LibriTTS. Comprehensive objective and subjective evaluations demonstrate superior performance over state-of-the-art autoregressive and non-autoregressive TTS systems.
📝 Abstract
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.