🤖 AI Summary
To address the insufficient robustness of large language model (LLM) text detection under out-of-distribution (OOD) conditions, this paper proposes a lightweight, fine-tuning-free detection method grounded in statistical characteristics of neural activation patterns. The method bypasses output probabilities and external classifiers; instead, it leverages a surrogate model to extract hidden-layer representations, computes projection scores along discriminative directions, and models intrinsic distributional disparities between human- and machine-generated text via raw activation statistics. Evaluated under stringent cross-model, cross-length, and adversarial perturbation settings, the approach demonstrates consistent performance stability. On multiple benchmarks, it achieves an average AUROC of 94.92%, substantially outperforming existing state-of-the-art methods. The framework delivers high accuracy, strong generalization across diverse OOD scenarios, and computational efficiency—making it particularly suitable for real-world deployment.
📝 Abstract
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard