DMAP: A Distribution Map for Text

๐Ÿ“… 2026-02-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing text analysis methods, such as perplexity, overlook the influence of context on the shape of the next-token probability distribution, limiting their ability to accurately capture the statistical characteristics of language modelโ€“generated text. This work proposes DMAP, a novel approach that introduces, for the first time, a mathematically rigorous distribution mapping mechanism to transform text into a set of samples within the unit interval via a language model. By jointly encoding token rank and probability information, DMAP enables efficient, model-agnostic statistical analysis. The method provides a unified representation of both probabilistic and ranking structures in text, demonstrating its effectiveness across three case studies: accurately inferring generation parameters, revealing the critical role of probability curvature in machine-generated text detection, and successfully tracing statistically detectable signatures of synthetic data in downstream models.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
Problem

Research questions and friction points this paper is trying to address.

next-token probability
conditional distribution
text analysis
large language models
statistical signal
Innovation

Methods, ideas, or system contributions that make the work stand out.

DMAP
distribution mapping
next-token probability
model-agnostic analysis
synthetic data forensics
๐Ÿ”Ž Similar Papers
No similar papers found.