The Statistical Signature of LLMs

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This study investigates how text generated by large language models (LLMs) alters the structural statistical properties of language and proposes a general-purpose detection method that does not rely on model internals or semantic evaluation. Using lossless compression ratio as a model-agnostic metric, the authors systematically compare the statistical regularities of human- and LLM-generated text across three scenarios: controlled continuation, knowledge mediation (Wikipedia vs. Grokipedia), and synthetic social environments (Moltbook vs. Reddit). The findings reveal that LLM-generated text is generally more structured and compressible than human text, though this distinction diminishes at smaller scales in fragmented interactive settings, highlighting fundamental limits to surface-level distinguishability. The approach demonstrates robustness across diverse models, tasks, and domains.

Technology Category

Application Category

📝 Abstract

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

Problem

Research questions and friction points this paper is trying to address.

large language models

statistical regularity

text generation

structural organization

compressibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

lossless compression

statistical regularity

large language models