Emergent morpho-phonological representations in self-supervised speech models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates how self-supervised speech models implicitly encode phonological and morphological inflectional variation between English nouns and verbs when recognizing spoken words under natural noise. Challenging the conventional assumption that models rely on explicit linguistic units, we propose an analytical framework based on a variant of S3M, integrating speech representation probing with linear geometric structure analysis. Results reveal that internal model representations exhibit a robust global linear structure: base forms and their regular inflected variants (e.g., *walk*/*walks*, *cat*/*cats*) are mapped to collinear vectors in embedding space. This geometric regularity persists under noisy conditions, indicating that the model acquires lexical regularities without dedicated phonological or morphological modules. To our knowledge, this is the first work to uncover interpretable, inflectionally aligned geometric representations in self-supervised speech models—providing novel evidence for speech–language joint modeling.

Technology Category

Application Category

📝 Abstract

Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon -- often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised speech models lack understood linguistic representation strategies

Models exhibit geometric structures tracking distributional word relationships

Findings challenge presumed necessity of distinct phonology-morphology representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models use global linear geometry for representation

Representations track distributional relationships in lexicon

Strategy challenges necessity of distinct linguistic units

🔎 Similar Papers

Sylber: Syllabic Embedding Representation of Speech from Raw Audio