RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the insufficient robustness of large language model (LLM) text detection under out-of-distribution (OOD) conditions, this paper proposes a lightweight, fine-tuning-free detection method grounded in statistical characteristics of neural activation patterns. The method bypasses output probabilities and external classifiers; instead, it leverages a surrogate model to extract hidden-layer representations, computes projection scores along discriminative directions, and models intrinsic distributional disparities between human- and machine-generated text via raw activation statistics. Evaluated under stringent cross-model, cross-length, and adversarial perturbation settings, the approach demonstrates consistent performance stability. On multiple benchmarks, it achieves an average AUROC of 94.92%, substantially outperforming existing state-of-the-art methods. The framework delivers high accuracy, strong generalization across diverse OOD scenarios, and computational efficiency—making it particularly suitable for real-world deployment.

Technology Category

Application Category

📝 Abstract

Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: https://github.com/NLP2CT/RepreGuard

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated text to prevent misuse

Improving robustness in out-of-distribution scenarios

Identifying statistical differences between AI and human texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM internal representations for detection

Extracts distinct activation features from texts

Projects scores for classification with threshold

🔎 Similar Papers

Learning to Rewrite: Generalized LLM-Generated Text Detection