A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the privacy risk of excessive sample memorization in tabular diffusion models. From a data-centric perspective, it is the first to systematically characterize memorization dynamics, revealing that memorization follows a heavy-tailed distribution and that easily memorized samples exhibit strong signals early in training. To mitigate this, we propose DynamicCut—a model-agnostic, transferable two-stage mitigation framework integrating per-sample memorization tracking, epoch-wise AUC-based memorization strength analysis, and dynamic pruning. Experiments demonstrate that DynamicCut significantly reduces memorization rates while preserving data utility, including statistical diversity and downstream task performance. It generalizes across multiple tabular datasets and generative models—including GANs and VAEs—and exhibits positive synergy with data-augmentation–based defenses.

Technology Category

Application Category

📝 Abstract

Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and (c) retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.

Problem

Research questions and friction points this paper is trying to address.

Identify high-risk samples causing privacy leaks in tabular diffusion models

Analyze training dynamics of memorized samples in early stages

Propose DynamicCut method to mitigate memorization while preserving data utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify memorization using relative distance ratio

Propose DynamicCut for two-stage mitigation

Enable cross-model transferability of high-ranked samples

🔎 Similar Papers

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon