Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and effectiveness of small- and reasoning-oriented open-weight large language models (LLMs) for academic paper quality assessment. We evaluate models with >4B parameters—including Gemma-3, Llama-4-Scout, Qwen-3, Magistral-Small, and DeepSeek-R1—on multidisciplinary biomedical and life sciences literature using zero-shot and few-shot scoring protocols, augmented by multi-query score averaging. Results show that most >4B-parameter models achieve performance comparable to proprietary baselines (ChatGPT-4o-mini and Gemini-2.0-Flash); score averaging markedly improves scoring stability and robustness; few-shot prompting yields only marginal gains; and reasoning-optimized architectures exhibit no systematic advantage. To our knowledge, this is the first empirical validation of high-performance small open-weight LLMs for scholarly quality evaluation, demonstrating their practical utility in resource-constrained settings and offering a novel pathway toward automated, scalable academic assessment.

Technology Category

Application Category

📝 Abstract
Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.
Problem

Research questions and friction points this paper is trying to address.

Evaluating smaller LLMs' ability to score journal article research quality
Assessing reasoning models' performance compared to standard LLMs for evaluation
Testing score averaging and few-shot prompts for improving assessment accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Small LLMs score articles similar to large models
Averaging multiple queries universally improves scoring accuracy
Few-shot prompts provide equivocal benefits for quality assessment
🔎 Similar Papers
No similar papers found.