WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multiple-choice benchmarks suffer from insufficient difficulty, limiting their ability to discriminate models’ true capabilities. To address this, we propose WiCkeD—a method that systematically injects semantic consistency and reasoning robustness challenges by automatically replacing any correct option with “None of the Above” (NOTA). This perturbation is agnostic to benchmark design and seamlessly integrates with arbitrary multiple-choice evaluation suites, enabling scalable assessment of open large language models. Evaluated across six mainstream benchmarks, WiCkeD induces an average accuracy drop of 12.1 points across 18 models; even chain-of-thought (CoT)-enhanced models exhibit significant degradation, exposing their sensitivity to implicit logical consistency. To our knowledge, this is the first work to formalize NOTA as a controllable difficulty-augmentation mechanism, offering fine-grained capability analysis beyond aggregate accuracy metrics. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with"None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Increase complexity of multiple-choice benchmarks
Evaluate models on enhanced difficulty
Assess sensitivity of models to additional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomly replaces choices with 'None of the above'
Automatically increases benchmark complexity
Evaluates model sensitivity to enhanced reasoning
🔎 Similar Papers
No similar papers found.