MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing self-supervised speech learning methods, which typically support only a single sampling rate and suffer performance degradation when trained on mixed-rate data due to temporal resolution mismatches. To overcome this, we propose MSR-HuBERT, the first framework enabling multi-sampling-rate self-supervised pretraining without resampling. MSR-HuBERT introduces a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms of varying sampling rates—ranging from 16 kHz to 48 kHz—to a unified time resolution while preserving their original structure, thereby maintaining compatibility with HuBERT’s masked prediction objective and Transformer encoder. Experiments demonstrate that MSR-HuBERT outperforms standard HuBERT in both automatic speech recognition and full-band speech reconstruction tasks, effectively retaining high-frequency details and low-frequency semantic structures.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.
Problem

Research questions and friction points this paper is trying to address.

self-supervised learning
speech processing
multiple sampling rates
temporal resolution mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-sampling-rate adaptation
self-supervised learning
HuBERT
adaptive downsampling CNN
speech representation learning
🔎 Similar Papers
No similar papers found.