Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This paper identifies a pervasive “evaluation bias” in LLM benchmarking: current prompts frequently embed explicit scoring cues—such as enforced chain-of-thought reasoning or rigid formatting constraints—distorting model behavior and inflating performance metrics without genuine capability improvement. To systematically investigate this, the authors design a reproducible A/B testing framework and conduct six controlled experiments on the GPT-OSS-20B model. Using deterministic validators, structured parsing, and multidimensional evaluation metrics, they analyze the impact of task framing and reasoning depth. Results show that evaluation-aware prompting induces redundant reasoning chains and reduced compliance; subtle changes in incentive wording shift error type distributions; and non-English prompts significantly degrade performance on high-reasoning tasks. Crucially, the study provides the first empirical evidence that evaluation-oriented prompting fails to enhance accuracy stability and instead introduces systematic biases.

Technology Category

Application Category

📝 Abstract

Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such "evaluation scent" inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.

Problem

Research questions and friction points this paper is trying to address.

Investigates whether evaluation framing inflates LLM performance without real capability gains

Analyzes how incentive wording affects error composition and output riskiness

Examines multilingual rubric impacts on accuracy parity and reasoning depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

A/B testing with evaluation vs real-world framing

Using deterministic validators for multiple metrics

Providing reproducible framework with practical guidance

🔎 Similar Papers

No similar papers found.