Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates the root causes of divergent linguistic information encoding between HuBERT and wav2vec 2.0, focusing on the roles of training objectives (contrastive learning vs. iterative pseudo-labeling) and architectural design. Using canonical correlation analysis (CCA), we quantitatively assess each hidden layer’s representational capacity for words, phonemes, and speaker identity, complemented by controlled ablation experiments. Results demonstrate that iterative pseudo-label refinement—not the contrastive objective per se—is the primary driver of their differential linguistic disentanglement capability. Moreover, HuBERT exhibits progressively stronger semantic representations with increasing training steps, achieving superior word/phoneme decoupling—particularly in higher-layer features. These findings highlight the critical role of “progressive supervision” in self-supervised speech modeling for capturing hierarchical linguistic structure, offering a novel, interpretable perspective for designing linguistically informed speech representations.

Technology Category

Application Category

📝 Abstract

Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.

Problem

Research questions and friction points this paper is trying to address.

Compare HuBERT and wav2vec 2.0 architectural differences

Study effect of iterative refinement on linguistic information

Investigate why iterative refinement improves speech representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative refinement distinguishes HuBERT from wav2vec 2.0

Training iteration affects linguistic representation correlation

Effectiveness of iterative refinement in encoding linguistic information

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations