Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work reveals that large language model (LLM) evaluators exhibit stylistic preference biases—such as favoring verbose or syntactically specific responses—which can be exploited as a security vulnerability to manipulate scoring outcomes. To address this, the authors propose BITE, a novel framework that models such stylistic biases as attack vectors and deceives LLM judges through semantics-preserving style edits under black-box, gradient-free conditions. The approach formalizes perturbation selection as a contextual multi-armed bandit problem and employs a LinUCB strategy to adaptively optimize editing operations. Experimental results demonstrate that BITE achieves attack success rates exceeding 65% across diverse evaluation tasks, consistently boosting scores by 1–2 points on a 9-point scale while evading existing style control and detection mechanisms, thereby exposing critical vulnerabilities in the LLM-as-a-judge paradigm.
📝 Abstract
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.
Problem

Research questions and friction points this paper is trying to address.

stylistic bias
LLM judges
adversarial attacks
semantic preservation
evaluation vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial attack
LLM-as-a-judge
contextual bandit
style manipulation
black-box optimization
🔎 Similar Papers