Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key limitations of traditional synchronous STEM assessments—namely restricted accessibility, security risks in resource sharing, and insufficient cross-institutional comparability—by proposing an asynchronous, generative AI–driven multi-attempt evaluation framework. Leveraging prompt chaining and tool-augmented generation, the system systematically varies surface features and contextual elements while preserving core physical concepts and difficulty levels, thereby constructing a large-scale isomorphic item bank. For the first time, multi-scale open-source language models (0.6B–32B parameters) are integrated to automatically pre-validate items and screen for difficulty consistency. Experimental results demonstrate that 73% of generated items achieve statistically homogeneous difficulty, with model-predicted scores showing strong correlation to actual student performance (Pearson r up to 0.594). Notably, medium-scale models (4B–14B parameters) exhibit optimal efficacy in detecting ambiguity and difficulty anomalies.

Technology Category

Application Category

📝 Abstract
Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $\rho$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.
Problem

Research questions and friction points this paper is trying to address.

STEM assessment
accessibility barriers
security concerns
cross-institutional comparability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative AI
isomorphic physics problems
prompt chaining
difficulty validation
language model scaling
🔎 Similar Papers
No similar papers found.
Naiming Liu
Naiming Liu
Rice University
AI for EducationLarge Language ModelsNatural Language Processing
L
Leo Murch
University of Central Florida
S
Spencer Moore
University of Central Florida
T
Tong Wan
University of Central Florida
Shashank Sonkar
Shashank Sonkar
Assistant Professor at University of Central Florida
Natural Language ProcessingMachine LearningArtificial Intelligence in Education
R
Richard G. Baraniuk
Rice University
Z
Zhongzhou Chen
University of Central Florida