🤖 AI Summary
Traditional item difficulty estimation relies on real student response data to fit Item Response Theory (IRT) models, incurring high data collection costs and failing to address the cold-start problem for newly introduced open-ended items.
Method: This paper proposes SMART, the first framework enabling cold-start difficulty prediction for open-ended items without requiring real responses. SMART leverages large language models (LLMs) to synthesize controllable, IRT-aligned artificial students; calibrates their ability distribution via Direct Preference Optimization (DPO); and infers item difficulty by generating synthetic responses and fitting IRT models thereto.
Contribution/Results: Experiments on real student datasets demonstrate that SMART significantly outperforms existing methods across prediction accuracy, generalizability, and scalability. It establishes a novel, efficient, and robust paradigm for item difficulty estimation—enabling scalable personalized learning and psychometric assessment without reliance on empirical response data.
📝 Abstract
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.