🤖 AI Summary
This study investigates how self-supervised speech models implicitly encode phonological and morphological inflectional variation between English nouns and verbs when recognizing spoken words under natural noise. Challenging the conventional assumption that models rely on explicit linguistic units, we propose an analytical framework based on a variant of S3M, integrating speech representation probing with linear geometric structure analysis. Results reveal that internal model representations exhibit a robust global linear structure: base forms and their regular inflected variants (e.g., *walk*/*walks*, *cat*/*cats*) are mapped to collinear vectors in embedding space. This geometric regularity persists under noisy conditions, indicating that the model acquires lexical regularities without dedicated phonological or morphological modules. To our knowledge, this is the first work to uncover interpretable, inflectionally aligned geometric representations in self-supervised speech models—providing novel evidence for speech–language joint modeling.
📝 Abstract
Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms.
This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon -- often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.