Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of authorship attribution for long-form texts generated by large language models (LLMs) in out-of-distribution (OOD) scenarios, such as cross-domain settings or when the target model is unknown. To tackle this problem, the authors propose TRACE, a lightweight and interpretable fingerprinting method that constructs textual fingerprints by extracting token-level transition patterns—such as word frequency rankings—using a compact language model. They also introduce GhostWriteBench, the first book-scale benchmark for LLM-generated text attribution, comprising documents exceeding 50,000 words. Experimental results demonstrate that TRACE achieves state-of-the-art performance across both closed-source and open-source LLMs, maintaining high accuracy and robustness even under data-scarce and OOD conditions, thereby significantly enhancing generalization capabilities.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.

Problem

Research questions and friction points this paper is trying to address.

authorship attribution

large language models

ghostwriting detection

out-of-distribution generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

authorship attribution

large language models

out-of-distribution generalization

fingerprinting