🤖 AI Summary
This study investigates human capability in detecting text generated by commercial large language models (LLMs). To address this, we recruited 300 annotators to perform human–machine discrimination on nonfiction English articles, providing rationales for their judgments. Our methodology integrates majority voting, qualitative analysis, and adversarial testing—including human editing and paraphrasing—to assess robustness. Results show that frequent LLM users achieve 99.7% detection accuracy without specialized training (only one misclassification among five-vote aggregates), substantially outperforming state-of-the-art automated detectors. Human judgments rely on both lexical features and deep semantic cues—such as excessive formality, low originality, and logical redundancy—revealing critical, previously unmodeled discriminative dimensions embedded in human intuition. To support reproducibility and further research, we publicly release a high-quality, human-annotated dataset and associated analysis code, establishing a new benchmark for AI-generated text detection and human–AI collaborative evaluation.
📝 Abstract
In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such"expert"annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.