🤖 AI Summary
Automated assessment of multiple-choice question (MCQ) difficulty in educational settings is costly and lacks robustness with existing methods. Method: This paper proposes a novel paradigm leveraging intrinsic uncertainty of large language models (LLMs): it quantifies confidence fluctuations and answer disagreements—measured via entropy, variance, and consistency—across multi-LLM collaborative reasoning as key proxy signals for question difficulty, and fuses them with question-stem text embeddings to train a random forest regression model. Contribution/Results: It is the first work to systematically model LLM cognitive uncertainty as an interpretable, annotation-free difficulty indicator, bypassing reliance on human labels or shallow textual features. Evaluated on USMLE and CMCQRD datasets, it achieves state-of-the-art performance; uncertainty-aware features significantly improve prediction accuracy, and estimated difficulty exhibits strong inverse correlation with empirical student pass rates (r < −0.85).
📝 Abstract
In an educational setting, an estimate of the difficulty of multiple-choice questions (MCQs), a commonly used strategy to assess learning progress, constitutes very useful information for both teachers and students. Since human assessment is costly from multiple points of view, automatic approaches to MCQ item difficulty estimation are investigated, yielding however mixed success until now. Our approach to this problem takes a different angle from previous work: asking various Large Language Models to tackle the questions included in three different MCQ datasets, we leverage model uncertainty to estimate item difficulty. By using both model uncertainty features as well as textual features in a Random Forest regressor, we show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question. In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the USMLE and CMCQRD publicly available datasets.