🤖 AI Summary
This work addresses a critical limitation in existing token-averaging–based text detection methods, which are susceptible to Simpson’s paradox due to their neglect of heterogeneous likelihood score distributions in the latent space, often obscuring strong local signals and impairing discrimination between human- and large language model–generated text. To mitigate this, the authors propose a Bayesian decision–theoretic local calibration mechanism that employs a lightweight conditional distribution predictor to recalibrate latent-space positions and replaces raw scores with calibrated log-likelihood ratios for aggregation. The approach seamlessly integrates into any token-averaging detection pipeline and achieves substantial performance gains across multiple baselines and datasets—for instance, boosting Fast-DetectGPT’s AUROC on GPT-4–generated text from 0.63 to 0.85, establishing a new state-of-the-art detection efficacy.
📝 Abstract
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.