Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

📅 2026-01-21

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study systematically uncovers and quantifies a critical vulnerability in using vision-language models (VLMs) as judges for evaluating agent performance in web-based tasks: their assessments are susceptible to manipulation through unfaithful content in the agents’ chain-of-thought (CoT) reasoning. By holding agent behavior constant and solely altering the CoT—via strategies such as reasoning trace rewriting, prompt engineering, and increased computational resources—the authors demonstrate that content-based manipulations can inflate false positive rates by up to 90% in state-of-the-art VLM judges. The work further distinguishes between stylistic and content-based manipulation tactics, revealing their differing efficacies. While existing defenses partially mitigate this fragility, they fail to fully resolve the underlying issue, highlighting a fundamental challenge in reliable automated evaluation grounded in unverifiable reasoning traces.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

Problem

Research questions and friction points this paper is trying to address.

LLM judge

chain-of-thought

agent evaluation

reasoning manipulation

evaluation vulnerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

chain-of-thought manipulation

LLM-as-a-judge

agent evaluation vulnerability