Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reading comprehension item difficulty assessment relies on labor-intensive manual annotation and large-scale administration, limiting scalability. Method: This paper pioneers a systematic investigation into the feasibility of leveraging large language models (LLMs)—specifically GPT-4o and o1—to automatically estimate item difficulty. Using the SARA dataset, we align LLM-generated difficulty scores with psychometric item parameters derived from Item Response Theory (IRT) and propose a joint evaluation framework integrating question answering performance and difficulty classification. Contribution/Results: LLM-based difficulty estimates exhibit statistically significant alignment with IRT parameters (p < 0.01) and demonstrate sensitivity to extreme-difficulty items. Estimates show cross-item-type stability and educational validity. This work establishes an LLM-driven paradigm for lightweight, scalable, and dynamic item difficulty assessment—enabling efficient, annotation-free difficulty calibration for adaptive learning systems.

Technology Category

Application Category

📝 Abstract
Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.
Problem

Research questions and friction points this paper is trying to address.

Automate reading comprehension difficulty estimation
Evaluate LLMs for question difficulty classification
Bridge traditional psychometrics with modern AIS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Automated difficulty assessment
Adaptive Instructional Systems
🔎 Similar Papers
No similar papers found.
Y
Yoshee Jain
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
John Hollander
John Hollander
Arkansas State University
Psycholinguisticsembodied cognitionreadingliteracy skills
A
Amber He
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
S
Sunny Tang
Heinz College of Information Systems and Public Policy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
L
Liang Zhang
Department of Electrical and Computer Engineering, University of Memphis, Memphis, TN 38152, USA; Institute for Intelligent Systems, University of Memphis, Memphis, TN 38152, USA
John Sabatini
John Sabatini
Institute for Intelligent Systems, University of Memphis
Readingliteracyassessment