UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Long-standing limitations in speech pretraining stem from the need for task-specific models—discriminative (e.g., ASR) and generative (e.g., TTS, speech tokenization)—which hinder unified representation learning. This paper introduces UniSpeech, the first unified speech pretraining framework built upon a shared encoder-decoder architecture. It jointly optimizes a representation encoder and a waveform-generation decoder through waveform-level autoregressive and non-autoregressive modeling, coupled with multi-task collaborative objectives to simultaneously learn discriminative and generative capabilities. Evaluated on ASR, TTS, and speech tokenization, UniSpeech achieves state-of-the-art performance across all three tasks. These results demonstrate the feasibility of replacing multiple specialized models with a single foundational model, substantially reducing pretraining cost and deployment complexity. UniSpeech thus establishes a new paradigm for speech foundation models.

Technology Category

Application Category

📝 Abstract

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.

Problem

Research questions and friction points this paper is trying to address.

Unified pre-training for speech representation and generation tasks

Joint learning of representation encoder and generative audio decoder

Single general-purpose model to replace task-specific foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pre-training for speech tasks

Encoder-decoder framework for representation and generation

Single model for discriminative and generative tasks

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

2024-06-09InterspeechCitations: 1

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs