🤖 AI Summary
This study investigates the feasibility and effectiveness of small- and reasoning-oriented open-weight large language models (LLMs) for academic paper quality assessment. We evaluate models with >4B parameters—including Gemma-3, Llama-4-Scout, Qwen-3, Magistral-Small, and DeepSeek-R1—on multidisciplinary biomedical and life sciences literature using zero-shot and few-shot scoring protocols, augmented by multi-query score averaging. Results show that most >4B-parameter models achieve performance comparable to proprietary baselines (ChatGPT-4o-mini and Gemini-2.0-Flash); score averaging markedly improves scoring stability and robustness; few-shot prompting yields only marginal gains; and reasoning-optimized architectures exhibit no systematic advantage. To our knowledge, this is the first empirical validation of high-performance small open-weight LLMs for scholarly quality evaluation, demonstrating their practical utility in resource-constrained settings and offering a novel pathway toward automated, scalable academic assessment.
📝 Abstract
Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.