🤖 AI Summary
To address the scarcity of real student data in learning analytics due to privacy constraints and regulatory restrictions, this paper proposes a novel synthetic data generation framework integrating CTGAN with lightweight LLMs (GPT-2, DistilGPT-2, DialoGPT), representing the first systematic investigation of such hybrid generative approaches in education. Methodologically, the framework jointly models structured tabular features and pedagogically meaningful semantic contexts to produce high-fidelity, privacy-compliant synthetic student records. A multi-dimensional utility evaluation framework is introduced, assessing statistical similarity, distributional alignment, and downstream predictive transfer performance. Experimental results demonstrate that the generated data achieves fidelity and modeling utility comparable to real data; notably, the incorporation of LLMs substantially enhances sparse feature representation and semantic coherence. This work validates the effectiveness and innovative potential of hybrid generative paradigms for learning analytics.
📝 Abstract
In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.