Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) exhibit significantly lower accuracy than supervised models in scoring open-ended responses in STEM formative assessments, particularly under low-resource conditions. Method: We propose a concept-driven assessment framework integrating concept-aligned prompt engineering, structured rubric orchestration, LLM-generated synthetic data, and joint training with lightweight supervised models. Contribution/Results: This work presents the first systematic validation that concept-based scoring rubrics effectively narrow the performance gap between LLMs and supervised models in low-resource settings. On a multi-source STEM student response dataset, LLM-based assessment accuracy improves by 32%. High-quality synthetic data generated by the LLM successfully trains a supervised model with fewer than 10 million parameters, achieving an F1 score of 0.89. Our approach simultaneously enhances LLM evaluation capability and enables efficient training of compact models, establishing a novel paradigm for low-resource educational AI assessment.

Technology Category

Application Category

📝 Abstract

Formative assessment in STEM topics aims to promote student learning by identifying students' current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM performance in formative STEM assessments

Bridging gap between LLMs and supervised classifiers

Generating synthetic data for training supervised models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-based rubrics enhance LLM assessment performance

LLMs generate synthetic data for supervised models

Applied across diverse STEM datasets with AI responses

🔎 Similar Papers

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring