Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?

📅 2024-07-07

🏛️ International Conference on Computational Linguistics

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study investigates whether generative uncertainty in large language models (LLMs) serves as a valid proxy for item difficulty in multiple-choice question answering. Using 451 authentic exam items from a biopsychology course, we systematically analyze the relationship between LLM response uncertainty—quantified via logit entropy and sampling variance—and human student performance metrics (e.g., accuracy, option selection frequencies), as well as fine-grained item types (e.g., conceptual discrimination, inferential application). Our key contribution is the first empirical demonstration of systematic, statistically significant differences in model uncertainty between correct and incorrect options. We further find a weak but robust negative correlation between uncertainty and item difficulty, with correlation strength varying across cognitive item types. These results establish a novel, annotation-free, and empirically grounded framework for automated item difficulty estimation, offering both theoretical insight and practical methodology for educational assessment and AI-driven test design.

Technology Category

Application Category

📝 Abstract

Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty, and exploit it towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models' behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.

Problem

Research questions and friction points this paper is trying to address.

Model Uncertainty

Multiple Choice Difficulty

Educational Impact

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Uncertainty

Educational Assessment

Psychobiology Question Set

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions