🤖 AI Summary
Answer consistency—i.e., the stability of outputs across multiple stochastic samplings for the same multiple-choice question—remains underexplored for small language models (SLMs, 2B–8B parameters), despite its critical implications for reliability in resource-constrained deployment.
Method: We conduct systematic evaluation on MMLU-Redux and MedQA, performing 10 repeated inference trials per question to quantify consistency under varying temperature settings, model scales (comparing 2B–8B SLMs against 50B–80B medium-sized models), and fine-tuning status. We introduce a novel consistency analysis and visualization framework.
Contribution/Results: We find SLMs achieve consistency on only 50%–80% of questions—significantly lower than medium-sized models. Crucially, accuracy of consistent answers strongly correlates with overall model accuracy (r > 0.9), providing empirical support for using consistency as a confidence-aware filtering criterion. This work establishes answer consistency as a fundamental dimension for assessing SLM reliability and offers actionable guidance for precision–certainty trade-offs in lightweight deployment.
📝 Abstract
This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.