Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the high cost and inefficiency of human pilot testing in calibrating item difficulty for standardized mathematics assessments. To provide a scalable and low-cost alternative, the authors propose a role-playing simulation approach leveraging open-source large language models (LLMs). Through carefully designed prompting, LLMs are instructed to emulate students with varying grade levels, mathematical abilities, genders, and racial backgrounds, generating synthetic response data to fit Item Response Theory (IRT) models and predict real-world item difficulty. Surprisingly, weaker-performing LLMs—such as Gemma, Llama, and Qwen—yield more accurate difficulty predictions than stronger counterparts, and incorporating demographic stratification significantly enhances prediction performance. The method achieves Pearson correlation coefficients of 0.75, 0.76, and 0.82 on NAEP mathematics items for grades 4, 8, and 12, respectively, demonstrating its potential to substantially reduce the reliance on costly human pilot testing.

Technology Category

Application Category

📝 Abstract

Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a"classroom"of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different"classroom sizes,"showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.

Problem

Research questions and friction points this paper is trying to address.

item difficulty estimation

standardized math assessment

large language models

Item Response Theory

student simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM simulation

Item Response Theory

difficulty estimation