Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study investigates the temporal predictability encoded in prosodic structure—specifically intonation, rhythm, and loudness—when lexical content is abstracted away, and examines its independent role in language understanding. Method: We propose Masked Prosody Modeling (MPM), a self-supervised framework that systematically analyzes how the temporal scale of self-supervised learning (SSL) objectives affects prosody modeling, leveraging multi-granularity acoustic representation learning. Contribution/Results: We provide the first empirical evidence that finer-granularity SSL objectives substantially improve prosodic structure modeling; representations derived from complex SSL encoders significantly outperform traditional hand-crafted features. Probe analyses demonstrate that learned representations support both long-horizon tasks (e.g., emotion recognition, with significant performance gains) and local temporal sensitivity (e.g., word boundary detection). Overall, our approach achieves substantial average improvements over baselines, establishing a novel paradigm for prosody-driven language understanding and furnishing an interpretable, acoustic representation foundation.

Technology Category

Application Category

📝 Abstract

People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the value of complex SSL-encoded structures compared to more constrained classical structures.

Problem

Research questions and friction points this paper is trying to address.

Examining prosody's role independent of lexical content

Assessing self-supervised learning for prosodic structure analysis

Comparing SSL-encoded structures with classical prosodic features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for prosody analysis

Masked Prosody Model predicts perceptual labels

SSL outperforms classical pitch and energy features

🔎 Similar Papers

No similar papers found.