Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study addresses the psychometric validity of generative AI for item authoring in educational assessment. We propose the first LLM-based framework for self-critical, iterative item generation: large language models automatically generate test items, which are then refined through multiple rounds of AI-driven evaluation and revision. A large-scale empirical validation was conducted across 91 university classrooms in the U.S. (N ≈ 1700), embedded within authentic instructional settings. Items were rigorously evaluated using Item Response Theory (IRT) to estimate key psychometric properties—including difficulty and discrimination. Results demonstrate no statistically significant differences between AI-generated and expert-authored items on these core metrics. This constitutes the first real-world evidence of psychometric equivalence and practical utility of LLM-authored assessments. The study provides a reproducible methodology and empirical foundation for AI-augmented educational measurement.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes -- covering computer science, mathematics, chemistry, and more -- in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.

Problem

Research questions and friction points this paper is trying to address.

Evaluating psychometric quality of AI-generated exam questions

Assessing AI-generated questions in real-world educational settings

Comparing AI-generated and expert-created exam questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative refinement strategy for question generation

Large-scale field study across multiple disciplines

AI-generated questions comparable to expert-created ones

🔎 Similar Papers

No similar papers found.