RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

195K/year
πŸ€– AI Summary
This work addresses the issue of word skipping and repetition in flow-matching-based text-to-speech (TTS) synthesis, which arises from inaccurate alignments. The authors propose a latent-space augmentation strategy that explicitly models failure modes without requiring external aligners or preference data, while preserving the original input length. This approach is integrated into a contrastive flow-matching framework and represents the first application of augmentation-based contrastive flow matching to enhance TTS content fidelity. It seamlessly fits into existing zero-shot TTS pipelines. Experimental results demonstrate consistent improvements: on Seed-TTS-eval, the word error rate (WER) decreases from 1.44% to 1.38%; on the ZERO500 benchmark, character error rates (CER) for English and Korean drop from 0.48% and 0.81% to 0.35% and 0.57%, respectively, with 24 function evaluations (NFE).
πŸ“ Abstract
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
content fidelity
skip errors
repeat errors
alignment robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
contrastive learning
latent augmentation
text-to-speech
alignment robustness
πŸ”Ž Similar Papers