Extracting books from production language models

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study systematically evaluates the risk that copyrighted training data, such as books, can still be extracted from mainstream production-grade large language models despite their safety safeguards. The authors propose a two-stage approach: first, they assess extractability using probe-based testing combined with a Best-of-N jailbreaking strategy; second, they employ iterative continuation prompting to reconstruct full or near-full texts and introduce nv-recall—a block-based approximate longest common substring algorithm—to quantitatively measure extraction fidelity. Experiments demonstrate successful partial or near-complete book extraction from Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3, achieving an nv-recall of up to 95.8%. This work provides the first empirical evidence of the substantial limitations in current safety mechanisms for preventing copyright data leakage.

Technology Category

Application Category

📝 Abstract

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

Problem

Research questions and friction points this paper is trying to address.

memorization

large language models

data extraction

production LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

training data extraction

production LLMs

jailbreak prompting

nv-recall