🤖 AI Summary
This study systematically investigates the impact of differential privacy (DP) fine-tuning on the generative quality and downstream utility of large language models (LLMs). We evaluate five state-of-the-art LLMs across three diverse corpora under four privacy budgets (ε = 0.1–8.0), using linguistic quality metrics—including output length, grammatical correctness, and n-gram diversity—as well as two real-world downstream tasks: book genre classification and cause-of-death identification. Our key contribution is the first empirical demonstration that stringent DP constraints severely degrade generation quality: text length drops by ≥77%, grammatical correctness declines by >9%, and bigram diversity falls by >10%; downstream task accuracy also suffers substantial degradation. These findings reveal a critical gap: current DP fine-tuning methods fail to simultaneously ensure privacy protection, generative robustness, and task utility. Our results establish a foundational benchmark for the privacy–quality trade-off and identify concrete directions for improving DP-aware LLM adaptation.
📝 Abstract
Ensuring user privacy by synthesizing data from large language models (LLMs) tuned under differential privacy (DP) has become popular recently. However, the impact of DP fine-tuned LLMs on the quality of the language and the utility of the texts they produce has not been investigated. In this work, we tune five LLMs with three corpora under four levels of privacy and assess the length, the grammatical correctness, and the lexical diversity of the text outputs they produce. We also probe the utility of the synthetic outputs in downstream classification tasks such as book genre recognition based on book descriptions and cause of death recognition based on verbal autopsies. The results indicate that LLMs tuned under stronger privacy constrains produce texts that are shorter by at least 77 %, that are less grammatically correct by at least 9 %, and are less diverse by at least 10 % in bi-gram diversity. Furthermore, the accuracy they reach in downstream classification tasks decreases, which might be detrimental to the usefulness of the generated synthetic data.