Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can serve as reliable proxies for real students in educational assessment, specifically evaluating the comparability of their performance on mathematics and reading comprehension tasks. Method: For the first time, 11 state-of-the-art LLMs are systematically embedded within an Item Response Theory (IRT) framework, anchored to the National Assessment of Educational Progress (NAEP) standardized scale, enabling cross-model, cross-grade, and cross-domain ability alignment. Contribution/Results: Unprompted, top-tier LLMs consistently outperform average human students; task-specific prompting modulates performance, but effects are highly model-, domain-, and grade-dependent—lacking generalizability. The work establishes the first IRT-based educational capability benchmarking paradigm for LLMs, uncovers systematic bias mechanisms in proxy alignment, and proposes a “task-driven–model-adapted” guideline for principled proxy selection in educational research and assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' accuracy in simulating real students' math and reading abilities
Compare LLMs' performance with real students using Item Response Theory
Evaluate grade-enforcement prompts' impact on LLM alignment with student abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs as proxy students for ITS development
Applying IRT to compare LLMs with real students
Grade-enforcement prompts adjust LLM performance levels
🔎 Similar Papers
No similar papers found.