🤖 AI Summary
This work addresses the limitations of existing question-answering difficulty assessment methods, which rely on readability metrics, retrieval signals, or popularity and fail to capture the intrinsic reasoning challenges of questions. The authors propose Q-DAPS, a novel approach that formulates question difficulty through the entropy of plausibility scores assigned to candidate answers—an interpretable, scalable, and bias-robust metric that aligns closely with human judgments. Evaluated across four benchmark datasets—TriviaQA, Natural Questions (NQ), MuSiQue, and QASC—Q-DAPS significantly outperforms established baselines. Moreover, it demonstrates consistent robustness across varying model scales, hyperparameter settings, and question types, underscoring its generalizability and practical utility in evaluating QA system performance.
📝 Abstract
Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.