🤖 AI Summary
This study quantifies the statistical dependence between lexical identity and prosody—particularly pitch—to uncover fundamental differences in how tone, pitch-accent, and stress languages distinguish lexical items.
Method: For the first time, mutual information is formalized as a continuous typological metric. Using a multilingual speech corpus, the approach integrates text–pitch curve alignment, entropy and mutual information estimation, and statistical modeling.
Contribution/Results: Tone languages exhibit significantly higher lexical–pitch mutual information than pitch-accent or stress languages, while pitch entropy remains comparable across types—indicating that prosodic encoding efficiency stems from pitch’s predictive power over lexical meaning, not its inherent variability. These findings challenge traditional discrete language-type classifications and provide information-theoretic evidence for a gradient prosodic typology. The work advances linguistic typology toward a quantitative, computationally grounded paradigm by establishing mutual information as a scalable, cross-linguistically comparable measure of prosodic–lexical integration.
📝 Abstract
This paper argues that the relationship between lexical identity and prosody -- one well-studied parameter of linguistic variation -- can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don't. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.