🤖 AI Summary
This paper identifies a pervasive “evaluation bias” in LLM benchmarking: current prompts frequently embed explicit scoring cues—such as enforced chain-of-thought reasoning or rigid formatting constraints—distorting model behavior and inflating performance metrics without genuine capability improvement. To systematically investigate this, the authors design a reproducible A/B testing framework and conduct six controlled experiments on the GPT-OSS-20B model. Using deterministic validators, structured parsing, and multidimensional evaluation metrics, they analyze the impact of task framing and reasoning depth. Results show that evaluation-aware prompting induces redundant reasoning chains and reduced compliance; subtle changes in incentive wording shift error type distributions; and non-English prompts significantly degrade performance on high-reasoning tasks. Crucially, the study provides the first empirical evidence that evaluation-oriented prompting fails to enhance accuracy stability and instead introduces systematic biases.
📝 Abstract
Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.