MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limitation of existing self-supervised speech learning methods, which typically support only a single sampling rate and suffer performance degradation when trained on mixed-rate data due to temporal resolution mismatches. To overcome this, we propose MSR-HuBERT, the first framework enabling multi-sampling-rate self-supervised pretraining without resampling. MSR-HuBERT introduces a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms of varying sampling rates—ranging from 16 kHz to 48 kHz—to a unified time resolution while preserving their original structure, thereby maintaining compatibility with HuBERT’s masked prediction objective and Transformer encoder. Experiments demonstrate that MSR-HuBERT outperforms standard HuBERT in both automatic speech recognition and full-band speech reconstruction tasks, effectively retaining high-frequency details and low-frequency semantic structures.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

Problem

Research questions and friction points this paper is trying to address.

self-supervised learning

speech processing

multiple sampling rates

temporal resolution mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-sampling-rate adaptation

self-supervised learning

HuBERT

adaptive downsampling CNN