Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study demonstrates that instruction fine-tuning and semantic prompting can bypass alignment safeguards to activate implicitly memorized copyrighted content in large language models, challenging the legal assumption that models do not retain training data. For the first time, it is systematically shown that mainstream models—including GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1—can faithfully reproduce unseen copyrighted books with high fidelity, generating verbatim excerpts exceeding 460 consecutive characters in a single output. Reproduction rates reach 85–90%, generalizing across more than 30 unrelated authors. Strikingly, different models exhibit highly consistent memory activation patterns for the same text (r ≥ 0.90), revealing a shared industry-wide vulnerability in model alignment and data retention.

Technology Category

Application Category

📝 Abstract
Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
Problem

Research questions and friction points this paper is trying to address.

copyright infringement
verbatim recall
alignment bypass
large language models
memorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

finetuning-induced memorization
verbatim recall
copyright vulnerability
alignment bypass
latent memorization
🔎 Similar Papers
No similar papers found.