π€ AI Summary
This work addresses the issue of word skipping and repetition in flow-matching-based text-to-speech (TTS) synthesis, which arises from inaccurate alignments. The authors propose a latent-space augmentation strategy that explicitly models failure modes without requiring external aligners or preference data, while preserving the original input length. This approach is integrated into a contrastive flow-matching framework and represents the first application of augmentation-based contrastive flow matching to enhance TTS content fidelity. It seamlessly fits into existing zero-shot TTS pipelines. Experimental results demonstrate consistent improvements: on Seed-TTS-eval, the word error rate (WER) decreases from 1.44% to 1.38%; on the ZERO500 benchmark, character error rates (CER) for English and Korean drop from 0.48% and 0.81% to 0.35% and 0.57%, respectively, with 24 function evaluations (NFE).
π Abstract
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/