🤖 AI Summary
Language models frequently verbatim reproduce pretraining data in non-adversarial settings, posing risks to copyright, privacy, and originality. To address this, we propose ParaPO—a post-training reinforcement alignment method grounded in paraphrasing preference optimization. ParaPO jointly leverages preference optimization and contrastive learning to steer models toward paraphrasing rather than literal repetition, while incorporating a plug-and-play system prompt mechanism that suppresses unnecessary verbatim copying without compromising legitimate quotation (e.g., canonical aphorisms). Experiments on Llama3.1-8B and Tulu3-8B demonstrate that ParaPO significantly reduces verbatim repetition rates in creative writing (e.g., from 17.3% to 12.9%), outperforming existing unlearning techniques, while preserving recall of well-known quotations. To our knowledge, ParaPO is the first alignment framework that jointly models paraphrasing preferences and enables controllable, prompt-based suppression of unwanted memorization—achieving both fidelity to source meaning and respect for intellectual property norms.
📝 Abstract
Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).