🤖 AI Summary
This study empirically tests the widely held hypothesis that power-law decay in natural language constitutes evidence for syntactic hierarchical structure. Method: Leveraging English dependency and phrase-structure trees, we conduct mutual information estimation, PCFG-fitting deviation analysis, and rigorous statistical significance testing—the first systematic empirical evaluation of the core assumption linking power laws to hierarchy. Contribution/Results: We find that syntactic trees systematically violate this assumption: power-law decay is neither a necessary nor sufficient consequence of hierarchical structure, and cross-domain generalizations—e.g., to child language or animal communication—are statistically unsupported. Our results refute the theoretical convention equating power laws with syntactic hierarchy, exposing fundamental limitations in their explanatory power. The study establishes a more rigorous statistical benchmark for modeling linguistic hierarchy and calls for foundational revision of related theoretical frameworks.
📝 Abstract
Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting the universal principles underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child languages and animal signals. However, the argument supporting this interpretation has not been empirically tested. To address this problem, this study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the implicit assumptions in the argument. Using English corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in parse trees, as well as in the PCFG that approximates these trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child languages and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.