Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large language models (LLMs) often achieve high scores on multiple-choice question answering (MCQA) benchmarks despite exhibiting low response consistency—i.e., unstable predictions under semantically preserved answer-option perturbations—undermining evaluation reliability. Method: This paper proposes Consistency-Rebalanced Accuracy (CoRA), a novel evaluation metric that systematically integrates response consistency into MCQA assessment. CoRA synthesizes perturbed variants of original questions by reordering or relabeling answer options, then computes two intermediate metrics—Bare-Minimum-Consistency Accuracy and Consistency Index—to dynamically recalibrate raw MCQA accuracy. Contribution/Results: Grounded in synthetic data generation, option reconstruction, and consistency analysis, CoRA is empirically validated across diverse LLMs and MCQA benchmarks. Experiments reveal that several high-scoring LLMs suffer from markedly low consistency; CoRA effectively identifies and downweights such models, significantly improving evaluation reliability and discriminative power over conventional accuracy.

Technology Category

Application Category

📝 Abstract

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Problem

Research questions and friction points this paper is trying to address.

Improving reliability of LLM scores on multiple-choice benchmarks

Addressing low response consistency despite high MCQA scores

Introducing CoRA metric to adjust scores for consistency evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Consistency-Rebalanced Accuracy (CoRA) metric

Uses synthetic questions with altered answer choices

Adjusts scores based on consistency via BMCA and CI

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

2024-06-18arXiv.orgCitations: 3

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal AI (PhD)