The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Answer consistency—i.e., the stability of outputs across multiple stochastic samplings for the same multiple-choice question—remains underexplored for small language models (SLMs, 2B–8B parameters), despite its critical implications for reliability in resource-constrained deployment. Method: We conduct systematic evaluation on MMLU-Redux and MedQA, performing 10 repeated inference trials per question to quantify consistency under varying temperature settings, model scales (comparing 2B–8B SLMs against 50B–80B medium-sized models), and fine-tuning status. We introduce a novel consistency analysis and visualization framework. Contribution/Results: We find SLMs achieve consistency on only 50%–80% of questions—significantly lower than medium-sized models. Crucially, accuracy of consistent answers strongly correlates with overall model accuracy (r > 0.9), providing empirical support for using consistency as a confidence-aware filtering criterion. This work establishes answer consistency as a fundamental dimension for assessing SLM reliability and offers actionable guidance for precision–certainty trade-offs in lightweight deployment.

Technology Category

Application Category

📝 Abstract
This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.
Problem

Research questions and friction points this paper is trying to address.

Evaluating answer consistency in small LLMs across repeated trials
Analyzing temperature and model size effects on response stability
Investigating accuracy-consistency tradeoffs in multiple-choice benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating small LLM consistency via repetition trials
Proposing new analytical tools for multi-trial assessment
Analyzing temperature impact on answer consistency
🔎 Similar Papers
No similar papers found.