SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

A large-scale, open-source educational dataset—comprising comprehensive assignment descriptions, rubrics, and student submissions—is currently lacking, hindering research on generalizable automated feedback generation. Method: We propose the Sophisticated Assignment Mimicry (SAM) framework, which leverages large language models to generate high-fidelity, privacy-preserving synthetic computer science assignments. SAM strictly preserves semantic consistency, length distribution, and rubric structure of real-world data through one-to-one mimicking. Generation quality is rigorously validated using BERTScore and Pearson correlation. Contribution/Results: We release a dataset of 10,000 synthetic submissions spanning 59 university courses. Feedback generation models trained exclusively on this synthetic data achieve performance on par with those trained on real data. This work establishes the first scalable, compliant, and high-fidelity paradigm for educational assignment synthesis, providing critical infrastructure for LLM-driven intelligent educational feedback research.

Technology Category

Application Category

📝 Abstract

Using LLMs to give educational feedback to students for their assignments has attracted much attention in the AI in Education field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In the current study, we constructed a large-scale dataset of Synthetic Computer science Assignments for LLM-generated Educational Feedback research (SCALEFeedback). We proposed a Sophisticated Assignment Mimicry (SAM) framework to generate the synthetic dataset by one-to-one LLM-based imitation from real assignment descriptions, student submissions to produce their synthetic versions. Our open-source dataset contains 10,000 synthetic student submissions spanning 155 assignments across 59 university-level computer science courses. Our synthetic submissions achieved BERTScore F1 0.84, PCC of 0.62 for assignment marks and 0.85 for length, compared to the corresponding real-world assignment dataset, while ensuring perfect protection of student private information. All these results of our SAM framework outperformed results of a naive mimicry method baseline. The LLM-generated feedback for our synthetic assignments demonstrated the same level of effectiveness compared to that of real-world assignment dataset. Our research showed that one-to-one LLM imitation is a promising method for generating open-source synthetic educational datasets that preserve the original dataset's semantic meaning and student data distribution, while protecting student privacy and institutional copyright. SCALEFeedback enhances our ability to develop LLM-based generalisable methods for offering high-quality, automated educational feedback in a scalable way.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale open-source dataset for educational feedback research

Need for generalizable methods for automatic educational feedback generation

Challenges in preserving student privacy and data distribution in synthetic datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based Sophisticated Assignment Mimicry framework

Large-scale synthetic dataset generation

Privacy-preserving educational feedback research

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Research Scientist Intern, Multimodal AI (PhD)