MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

📅 2024-06-09

🏛️ Interspeech

📈 Citations: 1

✨ Influential: 0

career value

205K/year

🤖 AI Summary

HuBERT suffers from pretrain-inference mismatch—discrepancies between representation learning during pretraining and inference—which limits ASR performance, particularly relative to data2vec. To address this, we propose Swap-HuBERT: (1) we first identify and systematically mitigate this mismatch; (2) we introduce Swap data augmentation and a novel multi-cluster masked prediction loss to enhance clustering robustness and improve model capacity utilization; and (3) we integrate self-supervised contrastive learning while optimizing the HuBERT architecture. On the LibriSpeech ASR benchmark, Swap-HuBERT reduces average word error rate by 5% across varying fine-tuning ratios, and its pretrained representations yield substantial gains on speech content understanding tasks. Our core contributions are: (i) the first explicit modeling and resolution of HuBERT’s pretrain-inference mismatch; (ii) the proposal of multi-cluster masked prediction as a new paradigm; and (iii) a more consistent and efficient self-supervised learning objective for speech representation learning.

Technology Category

Application Category

📝 Abstract

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.

Problem

Research questions and friction points this paper is trying to address.

HuBERT inconsistency

Speech recognition performance

data2vec comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

MS-HuBERT

self-supervised learning

speech recognition

🔎 Similar Papers

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models