🤖 AI Summary
This study addresses key limitations of traditional synchronous STEM assessments—namely restricted accessibility, security risks in resource sharing, and insufficient cross-institutional comparability—by proposing an asynchronous, generative AI–driven multi-attempt evaluation framework. Leveraging prompt chaining and tool-augmented generation, the system systematically varies surface features and contextual elements while preserving core physical concepts and difficulty levels, thereby constructing a large-scale isomorphic item bank. For the first time, multi-scale open-source language models (0.6B–32B parameters) are integrated to automatically pre-validate items and screen for difficulty consistency. Experimental results demonstrate that 73% of generated items achieve statistically homogeneous difficulty, with model-predicted scores showing strong correlation to actual student performance (Pearson r up to 0.594). Notably, medium-scale models (4B–14B parameters) exhibit optimal efficacy in detecting ambiguity and difficulty anomalies.
📝 Abstract
Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $\rho$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.