🤖 AI Summary
This study addresses the dual challenges of data scarcity and privacy preservation in educational technology by establishing the first benchmark framework for evaluating synthetic data generation methods in education. Leveraging a large-scale dataset of student performance from over 10,000 individuals, the work systematically compares three resampling techniques (e.g., SMOTE, Bootstrap) against three deep generative models (Autoencoders, VAEs, Copula-GAN) across three key criteria: distributional fidelity (measured by Kolmogorov–Smirnov and Jensen–Shannon distances), machine learning utility (via TSTR accuracy), and privacy protection (assessed by Distance to Closest Record, DCR). The findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR ≈ 0.997) but offer no privacy (DCR ≈ 0), whereas deep generative models can provide strong privacy guarantees (DCR ≈ 1) at the cost of reduced utility. Among them, the VAE emerges as the optimal compromise, retaining 83.3% predictive performance while ensuring robust privacy, leading to a context-aware strategy for method selection.
📝 Abstract
Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility such as Train-on-Synthetic-Test-on-Real scores (TSTR), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but completely fail privacy protection (DCR ~ 0.00), while deep learning models provide strong privacy guarantees (DCR ~ 1.00) at significant utility cost. Variational Autoencoders emerge as the optimal compromise, maintaining 83.3% predictive performance while ensuring complete privacy protection. We also provide actionable recommendations: use traditional resampling for internal development where privacy is controlled, and VAEs for external data sharing where privacy is paramount. This work establishes a foundational benchmark and practical decision framework for synthetic data generation in learning analytics.