🤖 AI Summary
This paper addresses the challenge of cleaning million-hour-scale multilingual speech data without relying on transcriptions or speaker IDs. The proposed method freezes a universal multilingual speech model (USM) as an unconditional feature encoder and introduces a lightweight parallel adapter architecture jointly optimized with the WaneFit neural vocoder. This design enables strong generalization across 300+ languages while maintaining minimal memory overhead. Trained end-to-end on 3,000 hours of degraded multilingual speech, the system achieves state-of-the-art or competitive performance in word error rate, speaker similarity, and objective/subjective speech quality. It attains a real-time factor of 0.0078 on a single GPU and can process one million hours of speech in three days using 100 GPUs. The approach significantly improves the efficiency of purifying training data for large-scale generative speech models.
📝 Abstract
Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.