🤖 AI Summary
To address the escalating storage and annotation costs associated with rapidly growing TTS datasets, this paper proposes the first active learning framework for corpus construction in text-to-speech synthesis. Departing from conventional static, model-agnostic data collection paradigms, our approach establishes a closed-loop “sampling–modeling–feedback” pipeline: it dynamically selects high-informativeness text–speech pairs by jointly evaluating model uncertainty and sample diversity, followed by incremental model retraining. Experiments demonstrate that corpora constructed via our method significantly improve synthesized speech naturalness—yielding MOS gains of +0.3 to +0.5 under identical data scale—while achieving full baseline performance using only 60% of the data. This enables substantially more efficient and higher-quality data utilization for TTS development.
📝 Abstract
The construction of high-quality datasets is a cornerstone of modern text-to-speech (TTS) systems. However, the increasing scale of available data poses significant challenges, including storage constraints. To address these issues, we propose a TTS corpus construction method based on active learning. Unlike traditional feed-forward and model-agnostic corpus construction approaches, our method iteratively alternates between data collection and model training, thereby focusing on acquiring data that is more informative for model improvement. This approach enables the construction of a data-efficient corpus. Experimental results demonstrate that the corpus constructed using our method enables higher-quality speech synthesis than corpora of the same size.