Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of low-quality synthetic data, excessive noise, and severe model bias in few-shot settings under differential privacy (DP), this paper proposes WASP: a framework leveraging dynamic weighted fusion of multiple pre-trained language models (PLMs), integrated with contrastive learning and DP constraints—without fine-tuning large models. WASP introduces a Top-Q voting strategy to achieve robust private distribution estimation and contrastive generation. By requiring only a small number of private samples and low-fidelity synthetic data, it effectively mitigates generation noise and model bias. Extensive experiments across six benchmark datasets and nine PLMs—including six open-source and three proprietary models—demonstrate significant improvements in downstream task performance under DP guarantees. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.
Problem

Research questions and friction points this paper is trying to address.

Data Synthesis
Privacy Protection
Data Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

WASP
Differential Privacy
Pre-trained Language Models
🔎 Similar Papers
No similar papers found.
Tianyuan Zou
Tianyuan Zou
Institute for AI Industry Research, Tsinghua University
CST
Y
Yang Liu
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yufei Xiong
Department of Mathematics, Harbin Institute of Technology, Weihai, Shandong, China
J
Jianqing Zhang
Shanghai Jiao Tong University, Shanghai, China
J
Jingjing Liu
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
X
Xiaozhou Ye
AsiaInfo Technologies, Shanghai, China
O
Ouyang Ye
AsiaInfo Technologies, Shanghai, China
Y
Ya-Qin Zhang
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China