The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Answer consistency—i.e., the stability of outputs across multiple stochastic samplings for the same multiple-choice question—remains underexplored for small language models (SLMs, 2B–8B parameters), despite its critical implications for reliability in resource-constrained deployment. Method: We conduct systematic evaluation on MMLU-Redux and MedQA, performing 10 repeated inference trials per question to quantify consistency under varying temperature settings, model scales (comparing 2B–8B SLMs against 50B–80B medium-sized models), and fine-tuning status. We introduce a novel consistency analysis and visualization framework. Contribution/Results: We find SLMs achieve consistency on only 50%–80% of questions—significantly lower than medium-sized models. Crucially, accuracy of consistent answers strongly correlates with overall model accuracy (r > 0.9), providing empirical support for using consistency as a confidence-aware filtering criterion. This work establishes answer consistency as a fundamental dimension for assessing SLM reliability and offers actionable guidance for precision–certainty trade-offs in lightweight deployment.

Technology Category

Application Category

📝 Abstract

This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.

Problem

Research questions and friction points this paper is trying to address.

Evaluating answer consistency in small LLMs across repeated trials

Analyzing temperature and model size effects on response stability

Investigating accuracy-consistency tradeoffs in multiple-choice benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating small LLM consistency via repetition trials

Proposing new analytical tools for multi-trial assessment

Analyzing temperature impact on answer consistency

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

2024-06-18arXiv.orgCitations: 3

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Authors to Follow