Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study identifies and formally defines the phenomenon of “contextual privacy collapse,” wherein benign fine-tuning can silently induce state-of-the-art language models to leak user information in tool-interaction or cross-session scenarios—without degrading standard safety or utility metrics. Through systematic experiments spanning six models (including both closed- and open-source variants), five fine-tuning datasets (comprising real-world and controlled data), and two task categories (agent-based and memory-related), the work reveals that fine-tuning disproportionately destabilizes privacy-sensitive representations while preserving task-relevant features. These findings demonstrate that current safety evaluation frameworks are ill-equipped to detect such vulnerabilities, posing significant risks for the deployment of specialized AI agents handling sensitive user contexts.

Technology Category

Application Category

📝 Abstract

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure''because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Problem

Research questions and friction points this paper is trying to address.

privacy collapse

contextual privacy

language models

fine-tuning

privacy vulnerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy collapse

contextual privacy

benign fine-tuning