Stylometry recognizes human and LLM-generated texts in short samples

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study addresses the challenge of attributing AI-generated text—particularly from large language models (LLMs)—to its source versus human authorship. Methodologically, we construct a benchmark dataset encompassing outputs from multiple LLMs (including GPT-4) and human-written texts; extract lexical, syntactic, and punctuation features using StyloMetrix and a custom n-gram pipeline; and employ LightGBM and decision trees for classification, augmented with SHAP for feature-level interpretability. Our key contribution is the first systematic identification of cross-model stable patterns in LLM text—specifically, heightened grammatical standardization and distinctive lexical distribution—enabling robust multi-class discrimination and fine-grained attribution. Experiments achieve a Matthews correlation coefficient of 0.87 on a 7-class classification task, binary classification accuracy ranging from 0.79 to 1.0, and 0.98 accuracy on Wikipedia vs. GPT-4 discrimination—demonstrating high accuracy, cross-dataset robustness, and transparent, interpretable decision-making.

Technology Category

Application Category

📝 Abstract

The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

Problem

Research questions and friction points this paper is trying to address.

Distinguish LLM-generated texts from human-written ones

Address model attribution and ethical AI concerns

Identify emergent writing patterns in LLM outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stylometry distinguishes LLM and human texts

Tree-based models classify text authorship

Shapley explanations identify key stylistic features

🔎 Similar Papers

Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods