Can You Trick the Grader? Adversarial Persuasion of LLM Judges

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates whether large language model (LLM) automated graders for mathematical reasoning are vulnerable to strategic rhetorical manipulations that compromise scoring fairness. Grounded in Aristotelian rhetoric, we systematically design and embed seven operationally defined persuasive techniques—including consistency appeals, flattery, and authority invocation—into mathematical problem responses and conduct adversarial evaluation across multiple mathematical reasoning benchmarks. Results show that such rhetoric significantly distorts grading: incorrect answers receive an average score inflation of +8 points, with consistency-based manipulation exhibiting the strongest effect; combinatorial use of multiple techniques further amplifies bias; scaling model size fails to mitigate this vulnerability; and existing defense mechanisms offer limited robustness. Our study uncovers a critical blind spot in LLM grader robustness, providing both empirical evidence and conceptual insight into the reliability of AI-based assessment systems.

Technology Category

Application Category

📝 Abstract

As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle's rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

Problem

Research questions and friction points this paper is trying to address.

Persuasive language biases LLM judges in scoring math tasks

Seven persuasion techniques inflate scores for incorrect solutions

Increasing model size fails to mitigate this vulnerability effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes seven rhetorical persuasion techniques

Embeds persuasive language in math responses

Tests bias in LLM judges across benchmarks

🔎 Similar Papers

No similar papers found.