Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
Traditional multiple-choice evaluations are vulnerable to shortcut strategies due to sparse answer options, limiting their ability to accurately assess large language models’ capabilities. This work proposes the first large-scale multiple-choice evaluation framework featuring up to one hundred options, introducing densely populated distractors in a Korean orthographic error detection task. The framework employs systematic stress-testing techniques—including repeated resampling, position randomization, context-length matching, and controlled padding—to disentangle content-related errors from positional biases. This approach reveals model deficiencies obscured in low-option settings, identifying two primary failure modes: semantic confusion and a preference for early-positioned options. Experimental results demonstrate a significant performance drop under high distractor density, indicating that the models’ bottleneck lies in candidate ranking rather than context-length handling.

Technology Category

Application Category

📝 Abstract
Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.
Problem

Research questions and friction points this paper is trying to address.

multiple choice evaluation
large language models
distractor density
model competence
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

massive-option evaluation
distractor density
position bias
semantic confusion
candidate ranking
🔎 Similar Papers
No similar papers found.