Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Zero-shot text-to-speech (TTS) remains challenged by unreliable synthesis, high inference latency, substantial computational overhead, and insufficient temporal diversity—limiting naturalness. To address these issues, we propose an efficient attention-free flow-matching framework based on a reconstruction-oriented flow-matching training paradigm. Our approach jointly models discrete (e.g., codec tokens) and continuous speech representations, explicitly enhancing temporal dynamics without self-attention. This design simultaneously enables low-latency streaming inference and high-fidelity prosody modeling. Experiments demonstrate that our method outperforms state-of-the-art approaches across intelligibility (WER = 4%), naturalness, and speaker similarity, achieving both high-quality speech generation and real-time rhythm control in a unified framework.

Technology Category

Application Category

📝 Abstract

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.

Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable synthesis and slow inference in zero-shot TTS

Enhancing temporal diversity for naturalness in synthesized speech

Reducing computational overhead while maintaining high speech fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching training paradigm reformulation for efficiency

Combined discrete and continuous speech attribute representations

Attention-free architecture enabling low-latency high-fidelity synthesis

🔎 Similar Papers

No similar papers found.