Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address label noise and environmental diversity in real-world speech scenarios within the WildSpoof 2026 Text-to-Speech (TTS) Track, this paper proposes the Self-Purifying Flow Matching (SPFM) framework. SPFM introduces, for the first time in TTS flow matching, an explicit noisy sample routing mechanism that dynamically identifies and isolates suspicious text–speech pairs while preserving their acoustic information for unconditional flow matching training. Integrated with the open-source Supertonic model, SPFM jointly optimizes conditional and unconditional flow matching losses, employs lightweight fine-tuning, and adopts dynamic sample weighting. Experiments demonstrate that SPFM achieves the lowest Word Error Rate (WER) on the WildSpoof TTS Track, while attaining second-best perceptual scores on both UTMOS and DNSMOS—substantially improving robustness to label noise and generalization across diverse acoustic conditions.

Technology Category

Application Category

📝 Abstract
This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, extit{Supertonic}footnote{url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.
Problem

Research questions and friction points this paper is trying to address.

Robust TTS adaptation to in-the-wild speech
Mitigate label noise in text-speech pairs
Enhance speech synthesis for diverse real-world conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Purifying Flow Matching for noise mitigation
Fine-tuning open-weight Supertonic TTS model
Routing suspicious pairs to unconditional training
🔎 Similar Papers
No similar papers found.