SynBench: A Benchmark for Differentially Private Text Generation

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In high-stakes domains such as healthcare and finance, sharing sensitive textual data is hindered by privacy leakage risks and the absence of a trusted, standardized benchmark for evaluating differentially private (DP) text generation. Method: We propose the first comprehensive evaluation framework for DP text synthesis, encompassing nine domain-specific datasets; introduce a novel membership inference attack tailored to synthetic text to quantify privacy leakage; and design a dual-dimensional metric jointly assessing utility (e.g., domain fidelity, semantic coherence) and privacy (ε-DP guarantees). Results: Experiments reveal that existing DP text generation methods suffer substantial utility degradation in complex domains and exhibit severe privacy–utility trade-off imbalances, underscoring the necessity of domain-aware evaluation. This work establishes a standardized benchmark, introduces a new attack paradigm, and provides foundational empirical insights for advancing DP text generation.

Technology Category

Application Category

📝 Abstract
Data-driven decision support in high-stakes domains like healthcare and finance faces significant barriers to data sharing due to regulatory, institutional, and privacy concerns. While recent generative AI models, such as large language models, have shown impressive performance in open-domain tasks, their adoption in sensitive environments remains limited by unpredictable behaviors and insufficient privacy-preserving datasets for benchmarking. Existing anonymization methods are often inadequate, especially for unstructured text, as redaction and masking can still allow re-identification. Differential Privacy (DP) offers a principled alternative, enabling the generation of synthetic data with formal privacy assurances. In this work, we address these challenges through three key contributions. First, we introduce a comprehensive evaluation framework with standardized utility and fidelity metrics, encompassing nine curated datasets that capture domain-specific complexities such as technical jargon, long-context dependencies, and specialized document structures. Second, we conduct a large-scale empirical study benchmarking state-of-the-art DP text generation methods and LLMs of varying sizes and different fine-tuning strategies, revealing that high-quality domain-specific synthetic data generation under DP constraints remains an unsolved challenge, with performance degrading as domain complexity increases. Third, we develop a membership inference attack (MIA) methodology tailored for synthetic text, providing first empirical evidence that the use of public datasets - potentially present in pre-training corpora - can invalidate claimed privacy guarantees. Our findings underscore the urgent need for rigorous privacy auditing and highlight persistent gaps between open-domain and specialist evaluations, informing responsible deployment of generative AI in privacy-sensitive, high-stakes settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluating differential privacy methods for synthetic text generation
Benchmarking privacy-utility tradeoffs in domain-specific datasets
Assessing privacy risks from pretraining data contamination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Privacy for synthetic text generation
Comprehensive evaluation framework with standardized metrics
Membership inference attack methodology for synthetic text
🔎 Similar Papers
No similar papers found.
Y
Yidan Sun
Imperial College London, Imperial Global Singapore
Viktor Schlegel
Viktor Schlegel
Deputy Director IN-CYPHER Programme @ IGS, Imperial College London
Natural Language UnderstandingAI for HealthcareClinical NLPAI Evaluation
S
Srinivasan Nandakumar
Imperial College London, Imperial Global Singapore
I
Iqra Zahid
Imperial College London, Imperial Global Singapore
Y
Yuping Wu
University of Manchester, United Kingdom
Y
Yulong Wu
University of Manchester, United Kingdom
H
Hao Li
University of Manchester, United Kingdom
J
Jie Zhang
CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore
W
Warren Del-Pinto
University of Manchester, United Kingdom
Goran Nenadic
Goran Nenadic
Department of Computer Science, University of Manchester
Natural language processingtext mininghealth informatics
S
Siew Kei Lam
Nanyang Technological University, Singapore
A
Anil Anthony Bharath
Imperial College London, United Kingdom