Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

📅 2025-01-03

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Balancing privacy preservation, algorithmic fairness, and model utility remains a fundamental challenge in learning analytics. Method: This study investigates the synergistic optimization of privacy and fairness via integrating differentially private synthetic data—generated using CTGAN and TVAE—with preprocessing fairness interventions (DECAF, Reweighting, ADULT). We systematically evaluate fairness improvements on both synthetic and real datasets under rigorous privacy constraints. Contribution/Results: We provide the first empirical evidence that preprocessing algorithms improve fairness by 23% on average over synthetic data, with only a 7% accuracy degradation; DECAF achieves the best trade-off across privacy and fairness metrics. Critically, synthetic data breaks the conventional privacy–fairness–utility tri-lemma, enabling simultaneous gains in all three dimensions. This work establishes the “synthetic + preprocessing” paradigm as a verifiable, end-to-end technical pathway for privacy-aware, fair, and effective learning analytics models.

Technology Category

Application Category

📝 Abstract

The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

Problem

Research questions and friction points this paper is trying to address.

Synthetic Data

Algorithmic Fairness

Privacy Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

DECAF algorithm

synthetic data

fairness enhancement

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review