🤖 AI Summary
This work addresses controllable narrative generation under extreme constraints: 90% of the output must consist of verbatim human-written text (“Frankentext” task), while preserving long-range logical coherence and prompt fidelity. We propose a two-stage method powered by Gemini-2.5-Pro: (1) a drafting stage that retrieves and concatenates heterogeneous human-authored text fragments; and (2) a revision stage that dynamically refines semantic coherence and enforces the target copy rate. Experiments show that 81% of outputs exhibit logical coherence and 100% strictly adhere to prompt specifications. We formally define the Frankentext task for the first time and empirically reveal a critical limitation of mainstream AI detectors—59% false-negative rates (misclassifying AI-generated Frankentext as human-written). Our contributions include a novel benchmark, a curated dataset, and a methodology enabling rigorous evaluation of hybrid authorship attribution and human-AI collaborative writing systems.
📝 Abstract
We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.