Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the limitations of existing question-answering difficulty assessment methods, which rely on readability metrics, retrieval signals, or popularity and fail to capture the intrinsic reasoning challenges of questions. The authors propose Q-DAPS, a novel approach that formulates question difficulty through the entropy of plausibility scores assigned to candidate answers—an interpretable, scalable, and bias-robust metric that aligns closely with human judgments. Evaluated across four benchmark datasets—TriviaQA, Natural Questions (NQ), MuSiQue, and QASC—Q-DAPS significantly outperforms established baselines. Moreover, it demonstrates consistent robustness across varying model scales, hyperparameter settings, and question types, underscoring its generalizability and practical utility in evaluating QA system performance.
📝 Abstract
Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
Problem

Research questions and friction points this paper is trying to address.

question difficulty estimation
large language models
answer plausibility
question answering
reasoning challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

question difficulty estimation
answer plausibility scoring
entropy-based metric
large language models
robust evaluation