Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Synthetic medical imaging data sharing faces dual challenges: legal compliance risks (e.g., GDPR violations) and subpar technical utility—synthetic data often underperforms real data in downstream tasks. To address this, we propose the first privacy-enhancing framework that jointly optimizes maximal generative diversity and predicate-based single-out (PSO) security, built upon diffusion models with a diversity-aware training strategy enabling individual-level de-identification. Our method achieves strong privacy guarantees—resisting re-identification—while significantly improving fidelity and utility: models trained on synthetic data attain performance within 1 percentage point of those trained on real data, surpassing non-private baselines. Our core contribution is the first explicit formulation of generative diversity as a privacy assurance dimension, co-optimized with theoretically grounded PSO constraints to holistically satisfy regulatory compliance, adversarial robustness, and practical usability.

Technology Category

Application Category

📝 Abstract

Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at https://github.com/MischaD/Trichotomy.

Problem

Research questions and friction points this paper is trying to address.

Addressing legal neglect in synthetic data sharing under GDPR

Improving synthetic data performance for downstream applications

Ensuring PSO security via diversity-aware diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-aware diffusion models for synthetic data

PSO-secure synthetic datasets via diversity maximization

Generalizable framework for unpersonal synthetic data

🔎 Similar Papers

No similar papers found.