A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the privacy risk of excessive sample memorization in tabular diffusion models. From a data-centric perspective, it is the first to systematically characterize memorization dynamics, revealing that memorization follows a heavy-tailed distribution and that easily memorized samples exhibit strong signals early in training. To mitigate this, we propose DynamicCut—a model-agnostic, transferable two-stage mitigation framework integrating per-sample memorization tracking, epoch-wise AUC-based memorization strength analysis, and dynamic pruning. Experiments demonstrate that DynamicCut significantly reduces memorization rates while preserving data utility, including statistical diversity and downstream task performance. It generalizes across multiple tabular datasets and generative models—including GANs and VAEs—and exhibits positive synergy with data-augmentation–based defenses.

Technology Category

Application Category

📝 Abstract
Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and (c) retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.
Problem

Research questions and friction points this paper is trying to address.

Identify high-risk samples causing privacy leaks in tabular diffusion models
Analyze training dynamics of memorized samples in early stages
Propose DynamicCut method to mitigate memorization while preserving data utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify memorization using relative distance ratio
Propose DynamicCut for two-stage mitigation
Enable cross-model transferability of high-ranked samples
🔎 Similar Papers
No similar papers found.
Zhengyu Fang
Zhengyu Fang
Case Western Reserve University
Machine learningDeep LearningGen AITime-SeriesAI for Science
Z
Zhimeng Jiang
Department of Computer Science & Engineering, Texas A&M University
Huiyuan Chen
Huiyuan Chen
Amazon
Machine LearningDeep LearningRecommender Systems
Xiaoge Zhang
Xiaoge Zhang
The Hong Kong Polytechnic University
Artificial IntelligenceRisk and ReliabilityData ScienceUncertainty Quantification
K
Kaiyu Tang
Department of Computer and Data Sciences, Case Western Reserve University
X
Xiao Li
Department of Computer and Data Sciences, Case Western Reserve University, Department of Biochemistry, Case Western Reserve University, Center for RNA Science and Therapeutics, Case Western Reserve University, Department of Biomedical Engineering, Case Western Reserve University
J
Jing Li
Department of Computer and Data Sciences, Case Western Reserve University