Privacy Preserving Diffusion Models for Mixed-Type Tabular Data Generation

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address secure sharing of mixed-type tabular data in sensitive domains (e.g., finance, healthcare), this paper proposes a diffusion-based generative framework under differential privacy (DP). The method introduces three key innovations: (1) an adaptive time-step sampling strategy to dynamically align the forward and reverse diffusion processes; (2) a feature-aggregation loss function that mitigates bias induced by gradient clipping; and (3) embedded categorical representation learning coupled with privacy-aware optimization, enabling effective modeling of high-dimensional mixed data. Evaluated on multiple real-world tabular datasets under strict privacy budgets (ε ≤ 2), the approach improves downstream task utility by 16–42% over state-of-the-art DP generative methods, while preserving strong privacy guarantees and high data fidelity. This work establishes a practical, efficient paradigm for synthesizing sensitive tabular data without compromising utility or privacy.

Technology Category

Application Category

📝 Abstract
We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.
Problem

Research questions and friction points this paper is trying to address.

Generates mixed-type tabular data with privacy
Adapts diffusion models for differential privacy training
Improves data utility while maintaining privacy guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding-based categorical feature representations reduce overhead
Adaptive timestep sampler aligns updates with diffusion dynamics
Feature-aggregated loss mitigates clipping-induced bias in DP-training
🔎 Similar Papers