None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit poor calibration and systematic failure in multiple-choice questions containing a “None of the Above” (NA) option, revealing fundamental deficits in meta-cognitive rejection and uncertainty awareness. Method: We conduct controlled experiments across 28 LLMs on the MMLU benchmark, integrating confidence-score analysis with cross-scale and cross-disciplinary attribution to isolate NA-specific performance degradation. Contribution/Results: We identify, for the first time, a catastrophic 30–50% average accuracy drop when NA is the correct answer—demonstrating a severe, domain-dependent deficit in option negation capability. Performance decline varies markedly by discipline: only −14.6% in mathematical reasoning versus −48.1% in business ethics, where uncertainty recognition is critical. These findings expose a core limitation in LLMs’ ability to withhold responses under epistemic uncertainty. Our work establishes a novel evaluation paradigm for uncertainty-aware reasoning and provides key empirical evidence for assessing meta-cognitive refusal capacity in foundation models.

Technology Category

Application Category

📝 Abstract
Multiple-choice exam questions with"None of the above"(NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Impact of 'None of the above' options on LLM performance.
LLMs' inability to reject all options when none are correct.
Domain-dependent performance drop in tasks requiring uncertainty handling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes LLM performance with NA options
Reveals 30-50% performance drop with NA
Highlights domain-dependent uncertainty handling issues
🔎 Similar Papers
No similar papers found.