🤖 AI Summary
Does natural language exhibit cross-scale periodicity in information density? This paper addresses this question by proposing “Autoperiodicity Detection via Surprise” (APS), the first method to systematically detect statistically significant periodic patterns in word-level surprisal sequences within single documents. APS integrates classical periodicity detection with harmonic regression modeling and employs rigorous statistical hypothesis testing to assess period significance—thereby moving beyond traditional analyses grounded in explicit syntactic or discourse units. Empirical evaluation across multiple multilingual corpora demonstrates that human language exhibits robust, statistically significant information periodicity, driven jointly by local syntactic constraints and long-range semantic and rhetorical factors. APS reliably identifies implicit, multi-scale periods ranging from several to dozens of words, exhibiting strong cross-linguistic generalizability. This work provides a novel quantitative framework for modeling linguistic cognition and understanding latent text structure.
📝 Abstract
Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.