Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper addresses the challenge of cleaning million-hour-scale multilingual speech data without relying on transcriptions or speaker IDs. The proposed method freezes a universal multilingual speech model (USM) as an unconditional feature encoder and introduces a lightweight parallel adapter architecture jointly optimized with the WaneFit neural vocoder. This design enables strong generalization across 300+ languages while maintaining minimal memory overhead. Trained end-to-end on 3,000 hours of degraded multilingual speech, the system achieves state-of-the-art or competitive performance in word error rate, speaker similarity, and objective/subjective speech quality. It attains a real-time factor of 0.0078 on a single GPU and can process one million hours of speech in three days using 100 GPUs. The approach significantly improves the efficiency of purifying training data for large-scale generative speech models.

Technology Category

Application Category

📝 Abstract

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

Problem

Research questions and friction points this paper is trying to address.

Generalizing speech restoration to unseen languages without explicit conditioning

Enabling efficient million-hour scale data cleaning for large generative models

Achieving high-quality restoration with low computational resource requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen pre-trained Universal Speech Model

Incorporates parallel adapters for efficiency

Employs WaneFit neural vocoder for synthesis

🔎 Similar Papers

No similar papers found.