Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the limited generalization and strong task specificity of existing speech generation models by proposing Metis, a unified foundation model. Methodologically, Metis introduces a novel dual discrete representation that jointly encodes self-supervised learning (SSL) tokens and acoustic tokens, trained via unconditional masked generative pretraining on only 300K hours of unlabeled speech, with a parameter count under 20M. The model supports zero-shot inference across five diverse speech tasks: text-to-speech (TTS), voice conversion, target speaker extraction, speech enhancement, and lip-to-speech synthesis. Compared to state-of-the-art task-specific and multi-task models, Metis achieves superior performance across all benchmarks, demonstrating strong cross-task adaptability and robust zero-shot generalization. Audio demonstrations and code are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Develops a unified speech generation model

Utilizes masked generative pre-training

Adapts to diverse speech tasks efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked generative pre-training

Discrete speech representations

Efficient fine-tuning adaptation

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

2024-06-09InterspeechCitations: 1