Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the challenges of low-quality synthetic data, excessive noise, and severe model bias in few-shot settings under differential privacy (DP), this paper proposes WASP: a framework leveraging dynamic weighted fusion of multiple pre-trained language models (PLMs), integrated with contrastive learning and DP constraints—without fine-tuning large models. WASP introduces a Top-Q voting strategy to achieve robust private distribution estimation and contrastive generation. By requiring only a small number of private samples and low-fidelity synthetic data, it effectively mitigates generation noise and model bias. Extensive experiments across six benchmark datasets and nine PLMs—including six open-source and three proprietary models—demonstrate significant improvements in downstream task performance under DP guarantees. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.

Problem

Research questions and friction points this paper is trying to address.

Data Synthesis

Privacy Protection

Data Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

WASP

Differential Privacy

Pre-trained Language Models

🔎 Similar Papers

No similar papers found.