Turning Black Box into White Box: Dataset Distillation Leaks

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work systematically uncovers a critical privacy risk in existing dataset distillation methods: when compressing real data into synthetic data, these approaches may implicitly encode the training trajectory of models, thereby leaking sensitive information about the original dataset. To expose this vulnerability, the authors propose an Information Revelation Attack (IRA) that integrates model inversion and membership inference techniques to effectively infer the distillation algorithm, model architecture, membership status, and even reconstruct sensitive samples from the synthetic data alone. Experimental results demonstrate that IRA can accurately identify both the distillation method and model structure, and successfully recover original sensitive data with high fidelity. These findings fundamentally challenge the prevailing assumption that dataset distillation inherently preserves privacy, revealing instead a severe and previously underappreciated privacy leakage risk.

Technology Category

Application Category

📝 Abstract

Dataset distillation compresses a large real dataset into a small synthetic one, enabling models trained on the synthetic data to achieve performance comparable to those trained on the real data. Although synthetic datasets are assumed to be privacy-preserving, we show that existing distillation methods can cause severe privacy leakage because synthetic datasets implicitly encode the weight trajectories of the distilled model, they become over-informative and exploitable by adversaries. To expose this risk, we introduce the Information Revelation Attack (IRA) against state-of-the-art distillation techniques. Experiments show that IRA accurately predicts both the distillation algorithm and model architecture, and can successfully infer membership and recover sensitive samples from the real dataset.

Problem

Research questions and friction points this paper is trying to address.

dataset distillation

privacy leakage

synthetic data

membership inference

information revelation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset distillation

privacy leakage

Information Revelation Attack