🤖 AI Summary
Large language models (LLMs) exhibit poor calibration and systematic failure in multiple-choice questions containing a “None of the Above” (NA) option, revealing fundamental deficits in meta-cognitive rejection and uncertainty awareness.
Method: We conduct controlled experiments across 28 LLMs on the MMLU benchmark, integrating confidence-score analysis with cross-scale and cross-disciplinary attribution to isolate NA-specific performance degradation.
Contribution/Results: We identify, for the first time, a catastrophic 30–50% average accuracy drop when NA is the correct answer—demonstrating a severe, domain-dependent deficit in option negation capability. Performance decline varies markedly by discipline: only −14.6% in mathematical reasoning versus −48.1% in business ethics, where uncertainty recognition is critical. These findings expose a core limitation in LLMs’ ability to withhold responses under epistemic uncertainty. Our work establishes a novel evaluation paradigm for uncertainty-aware reasoning and provides key empirical evidence for assessing meta-cognitive refusal capacity in foundation models.
📝 Abstract
Multiple-choice exam questions with"None of the above"(NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.