🤖 AI Summary
To address secure sharing of mixed-type tabular data in sensitive domains (e.g., finance, healthcare), this paper proposes a diffusion-based generative framework under differential privacy (DP). The method introduces three key innovations: (1) an adaptive time-step sampling strategy to dynamically align the forward and reverse diffusion processes; (2) a feature-aggregation loss function that mitigates bias induced by gradient clipping; and (3) embedded categorical representation learning coupled with privacy-aware optimization, enabling effective modeling of high-dimensional mixed data. Evaluated on multiple real-world tabular datasets under strict privacy budgets (ε ≤ 2), the approach improves downstream task utility by 16–42% over state-of-the-art DP generative methods, while preserving strong privacy guarantees and high data fidelity. This work establishes a practical, efficient paradigm for synthesizing sensitive tabular data without compromising utility or privacy.
📝 Abstract
We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.