🤖 AI Summary
This study investigates the emergence mechanisms of linguistic structure in self-supervised speech models. By analyzing Wav2Vec2 and HuBERT across training stages and network layers, the work systematically characterizes how phonemic, lexical, and syntactic structures are encoded, leveraging intermediate checkpoint evaluation, probing analyses, and iterative pseudo-label refinement. The findings reveal that different levels of linguistic structure emerge at distinct times and locations within the models; higher-level predictive tasks facilitate more parallel acquisition of linguistic representations; and the abstraction level inherent in the pretraining objective critically shapes the patterns of structural emergence. This is the first comprehensive account of the hierarchical distribution and learning trajectory of linguistic information in self-supervised speech representation learning.
📝 Abstract
Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).