Tracking the emergence of linguistic structure in self-supervised models learning from speech

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study investigates the emergence mechanisms of linguistic structure in self-supervised speech models. By analyzing Wav2Vec2 and HuBERT across training stages and network layers, the work systematically characterizes how phonemic, lexical, and syntactic structures are encoded, leveraging intermediate checkpoint evaluation, probing analyses, and iterative pseudo-label refinement. The findings reveal that different levels of linguistic structure emerge at distinct times and locations within the models; higher-level predictive tasks facilitate more parallel acquisition of linguistic representations; and the abstraction level inherent in the pretraining objective critically shapes the patterns of structural emergence. This is the first comprehensive account of the hierarchical distribution and learning trajectory of linguistic information in self-supervised speech representation learning.

Technology Category

Application Category

📝 Abstract

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

Problem

Research questions and friction points this paper is trying to address.

linguistic structure

self-supervised learning

speech models

emergence

Wav2Vec2

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech models

linguistic structure emergence

layerwise representation