🤖 AI Summary
Traditional reading comprehension item difficulty assessment relies on labor-intensive manual annotation and large-scale administration, limiting scalability. Method: This paper pioneers a systematic investigation into the feasibility of leveraging large language models (LLMs)—specifically GPT-4o and o1—to automatically estimate item difficulty. Using the SARA dataset, we align LLM-generated difficulty scores with psychometric item parameters derived from Item Response Theory (IRT) and propose a joint evaluation framework integrating question answering performance and difficulty classification. Contribution/Results: LLM-based difficulty estimates exhibit statistically significant alignment with IRT parameters (p < 0.01) and demonstrate sensitivity to extreme-difficulty items. Estimates show cross-item-type stability and educational validity. This work establishes an LLM-driven paradigm for lightweight, scalable, and dynamic item difficulty assessment—enabling efficient, annotation-free difficulty calibration for adaptive learning systems.
📝 Abstract
Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.