Cyborg Data: Merging Human with AI Generated Training Data

๐Ÿ“… 2025-03-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automated scoring (AS) systems critically depend on large-scale, labor-intensive human annotations, incurring prohibitive costs. To address this, we propose a teacher-student distillation framework termed โ€œCyborg Data,โ€ wherein a large language model serves as the teacher to generate high-quality pseudo-labels from minimal human-scored examples; these pseudo-labels are then fused with sparse ground-truth annotations to construct human-AI collaborative training data. A lightweight student model trained on this hybrid dataset achieves performance comparable to that of models trained on full human annotations (โ‰ˆ100% relative accuracy on standard benchmarks), while attaining higher inter-rater agreement than conventional human-human scoring. This work introduces the Cyborg Data paradigmโ€”the first to simultaneously achieve high accuracy and practical deployability in AS under few-shot settings (using only 10% of the human-labeled data), thereby establishing a new foundation for scalable, cost-effective automated assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset"Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on"Cyborg Data"show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.
Problem

Research questions and friction points this paper is trying to address.

Reduce reliance on costly human-scored training data
Improve small model performance using large model distillation
Achieve human-level accuracy with minimal fine-tuning data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model distillation pipeline for efficient training
Teacher-student architecture with generative models
Cyborg Data merges human and AI scores
๐Ÿ”Ž Similar Papers
No similar papers found.