Your Image Generator Is Your New Private Dataset

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

To address the challenges of scarce real-world image data, high annotation costs, and prominent privacy risks in image classification, this paper proposes Text-Conditioned Knowledge Recycling (TCKR), a novel framework integrating dynamic image captioning, LoRA-efficient fine-tuning of diffusion models, and generative knowledge distillation to produce high-fidelity, semantically consistent synthetic training data. Evaluated on ten benchmark datasets, classifiers trained exclusively on TCKR-synthesized data achieve accuracy comparable to—or exceeding—that of models trained on real data. Moreover, TCKR significantly enhances privacy robustness, reducing the area under the curve (AUC) of membership inference attacks by an average of 5.49 points. This work presents the first systematic empirical validation of simultaneous optimization for both classification performance and privacy protection via synthetic data generation, establishing a new paradigm for vision model training under low-data and high-privacy constraints.

Technology Category

Application Category

📝 Abstract

Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity with synthetic training data

Improving domain adaptation for text-conditioned image generation

Enhancing privacy in classifier training with synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic image captioning for informative prompts

Parameter-efficient diffusion model fine-tuning

Generative Knowledge Distillation for synthetic datasets

🔎 Similar Papers

Tackling copyright issues in AI image generation through originality estimation and genericization