🤖 AI Summary
This work addresses the challenge of sharing educational data under stringent privacy regulations, where existing synthetic data generation methods often suffer from distorted marginal distributions and iterative drift. The authors propose a nonparametric Gaussian Copula (NPGC)-based framework that preserves variable marginal characteristics through empirical marginal distributions, models dependency structures via Copula functions, and explicitly handles heterogeneous variable types and missing values. Differential privacy is integrated at both marginal and correlation levels to balance privacy guarantees with statistical fidelity. Experiments across five benchmark datasets demonstrate that the method consistently produces high-quality synthetic data, achieving strong downstream task performance while incurring significantly lower computational overhead than deep learning baselines. The framework has been successfully deployed in a real-world online learning platform.
📝 Abstract
To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.