Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limited generalization of current AI-generated text detectors in real-world scenarios, where it remains unclear whether models learn universal machine-authorship signatures or dataset-specific stylistic artifacts. To investigate this, the authors propose a detection framework integrating linguistic feature engineering, machine learning, and SHAP-based interpretability analysis. Their approach systematically reveals a fundamental contradiction: linguistic features effective within a domain often fail to generalize across domains. Comprehensive cross-generator and cross-domain evaluations demonstrate that prevailing methods heavily rely on dataset-specific cues rather than stable generative signals. While the model achieves a strong in-domain F1 score of 0.9734 on PAN CLEF 2025 and COLING 2025 benchmarks, its performance degrades significantly under domain shift. The authors release an open-source Python toolkit supporting both prediction and instance-level explanations.

Technology Category

Application Category

📝 Abstract

The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

Problem

Research questions and friction points this paper is trying to address.

AI-generated text detection

generalization failure

dataset-specific artefacts

explainable AI

domain shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable AI

AI-generated text detection

cross-domain generalization