RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of large language model (LLM) text detection under out-of-distribution (OOD) conditions, this paper proposes a lightweight, fine-tuning-free detection method grounded in statistical characteristics of neural activation patterns. The method bypasses output probabilities and external classifiers; instead, it leverages a surrogate model to extract hidden-layer representations, computes projection scores along discriminative directions, and models intrinsic distributional disparities between human- and machine-generated text via raw activation statistics. Evaluated under stringent cross-model, cross-length, and adversarial perturbation settings, the approach demonstrates consistent performance stability. On multiple benchmarks, it achieves an average AUROC of 94.92%, substantially outperforming existing state-of-the-art methods. The framework delivers high accuracy, strong generalization across diverse OOD scenarios, and computational efficiency—making it particularly suitable for real-world deployment.

Technology Category

Application Category

📝 Abstract
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard
Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated text to prevent misuse
Improving robustness in out-of-distribution scenarios
Identifying statistical differences between AI and human texts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM internal representations for detection
Extracts distinct activation features from texts
Projects scores for classification with threshold
🔎 Similar Papers
No similar papers found.
X
Xin Chen
NLP2CT Lab, Department of Computer and Information Science, University of Macau
J
Junchao Wu
NLP2CT Lab, Department of Computer and Information Science, University of Macau
S
Shu Yang
Provable Responsible AI and Data Analytics Lab, KAUST
Runzhe Zhan
Runzhe Zhan
Ph.D. Candidate, University of Macau
Machine TranslationLanguage ModelsMultilinguality
Z
Zeyu Wu
NLP2CT Lab, Department of Computer and Information Science, University of Macau
Ziyang Luo
Ziyang Luo
Salesforce AI Research
AgentsLLMsMultimodal
D
Di Wang
Provable Responsible AI and Data Analytics Lab, KAUST
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
Lidia S. Chao
Lidia S. Chao
University of Macau
Derek F. Wong
Derek F. Wong
Professor, Department of Computer and Information Science, University of Macau
Machine TranslationNeural Machine TranslationNatural Language ProcessingMachine Learning