🤖 AI Summary
HuBERT suffers from pretrain-inference mismatch—discrepancies between representation learning during pretraining and inference—which limits ASR performance, particularly relative to data2vec. To address this, we propose Swap-HuBERT: (1) we first identify and systematically mitigate this mismatch; (2) we introduce Swap data augmentation and a novel multi-cluster masked prediction loss to enhance clustering robustness and improve model capacity utilization; and (3) we integrate self-supervised contrastive learning while optimizing the HuBERT architecture. On the LibriSpeech ASR benchmark, Swap-HuBERT reduces average word error rate by 5% across varying fine-tuning ratios, and its pretrained representations yield substantial gains on speech content understanding tasks. Our core contributions are: (i) the first explicit modeling and resolution of HuBERT’s pretrain-inference mismatch; (ii) the proposal of multi-cluster masked prediction as a new paradigm; and (iii) a more consistent and efficient self-supervised learning objective for speech representation learning.
📝 Abstract
In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.