Prompt perturbation and fraction facilitation sometimes strengthen Large Language Model scores

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of improving the accuracy and reliability of large language models (LLMs) in scoring the quality of scientific texts (e.g., journal articles). To mitigate limitations imposed by prompt design—particularly its impact on scoring consistency and confidence calibration—we propose four strategies: semantic-equivalent prompt perturbation, score decimalization for fine-grained resolution, partial input masking, and multi-prompt score averaging. We conduct large-scale empirical validation across multiple LLMs, involving 1.7 million invocations of Gemma-3 27B. Results show that semantic-equivalent prompt mixing significantly reduces null-response rates and enhances scoring stability; multi-prompt averaging proves the most robust optimization, though optimal strategy selection is model-specific. Crucially, this work pioneers the decoupling of prompt perturbation from internal confidence estimation, establishing a reproducible and transferable prompt engineering paradigm for fine-grained LLM evaluation tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can be tasked with scoring texts according to pre-defined criteria and on a defined scale, but there is no recognised optimal prompting strategy for this. This article focuses on the task of LLMs scoring journal articles for research quality on a four-point scale, testing how user prompt design can enhance this ability. Based primarily on 1.7 million Gemma3 27b queries for 2780 health and life science articles with 58 similar prompts, the results show that improvements can be obtained by (a) testing semantically equivalent prompt variations, (b) averaging scores from semantically equivalent prompts, (c) specifying that fractional scores are allowed, and possibly also (d) not drawing attention to the input being partial. Whilst (a) and (d) suggests that models can be sensitive to how a task is phrased, (b) and (c) suggest that strategies to leverage more of the model's knowledge are helpful, such as by perturbing prompts and facilitating fractions. Perhaps counterintuitively, encouraging incorrect answers (fractions for this task) releases useful information about the model's certainty about its answers. Mixing semantically equivalent prompts also reduces the chance of getting no score for an input. Additional testing showed that the best prompts vary between LLMs, however, and were almost the opposite for ChatGPT 4o-mini, weakly aligned for Llama4 Scout and Magistral, and made little difference to Qwen3 32b and DeepSeek R1 32b. Overall, whilst there is no single best prompt, a good strategy for all models was to average the scores from a range of different semantically equivalent or similar prompts.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompt design for LLM scoring tasks
Enhancing scoring accuracy through prompt variations and averaging
Exploring model-specific prompt effectiveness across different LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing semantically equivalent prompt variations
Averaging scores from semantically equivalent prompts
Specifying fractional scores to release model certainty
🔎 Similar Papers
No similar papers found.