"I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the automatic detection and interpretability of text generated by large language models (LLMs), systematically investigating both binary (human vs. LLM) and ternary (adding an “inconclusive” class) classification tasks. We propose the first evaluation framework for assessing LLMs’ self-detection capability, benchmarking six mainstream open- and closed-source models. Results show that self-detection significantly outperforms cross-model detection, yet average accuracy remains only ~68%. Introducing the “inconclusive” class improves both classification accuracy and explanation quality significantly (p < 0.01). We curate a high-quality human-annotated dataset and, for the first time, identify and categorize three major types of explanation errors—most prominently, reliance on spurious features. Our findings establish a new, interpretable, and robust paradigm for governing LLM-generated content.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated impressive capabilities in generating human-like texts, but the potential misuse of such LLM-generated texts raises the need to distinguish between human-generated and LLM-generated content. This paper explores the detection and explanation capabilities of LLM-based detectors of LLM-generated texts, in the context of a binary classification task (human-generated texts vs LLM-generated texts) and a ternary classification task (human-generated texts, LLM-generated texts, and undecided). By evaluating on six close/open-source LLMs with different sizes, our findings reveal that while self-detection consistently outperforms cross-detection, i.e., LLMs can detect texts generated by themselves more accurately than those generated by other LLMs, the performance of self-detection is still far from ideal, indicating that further improvements are needed. We also show that extending the binary to the ternary classification task with a new class"Undecided"can enhance both detection accuracy and explanation quality, with improvements being statistically significant and consistent across all LLMs. We finally conducted comprehensive qualitative and quantitative analyses on the explanation errors, which are categorized into three types: reliance on inaccurate features (the most frequent error), hallucinations, and incorrect reasoning. These findings with our human-annotated dataset emphasize the need for further research into improving both self-detection and self-explanation, particularly to address overfitting issues that may hinder generalization.

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated texts accurately.

Extending binary to ternary classification tasks.

Improving self-detection and self-explanation capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs detect LLM-generated texts

Ternary classification enhances accuracy

Address overfitting in self-detection

🔎 Similar Papers

Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated