Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a critical vulnerability in multi-LLM agent workflows—the instability of judge-model feedback, where hallucinations, biases, or adversarial behaviors severely degrade system reliability. Method: We propose an “Intention–Knowledge” two-dimensional framework to systematically characterize judge model misbehavior; introduce WAFER-QA, a retrieval-augmented benchmark grounded in factual evidence for quantifying robustness against adversarial feedback; and design three mitigation strategies: retrieval-enhanced judgment, adversarial feedback injection, and fact-aligned critical generation. Contributions/Results: Experiments reveal that even strong reasoning models can be misled into erroneous decisions by a single round of adversarial feedback; reasoning and non-reasoning models exhibit distinct behavioral evolution over multi-turn interactions. Empirical evaluation shows that mainstream agents suffer substantial accuracy drops under factually grounded malicious judgments. WAFER-QA establishes a new standard for evaluating agent robustness in judge-mediated workflows.

Technology Category

Application Category

📝 Abstract
Agentic workflows -- where multiple large language model (LLM) instances interact to solve tasks -- are increasingly built on feedback mechanisms, where one model evaluates and critiques another. Despite the promise of feedback-driven improvement, the stability of agentic workflows rests on the reliability of the judge. However, judges may hallucinate information, exhibit bias, or act adversarially -- introducing critical vulnerabilities into the workflow. In this work, we present a systematic analysis of agentic workflows under deceptive or misleading feedback. We introduce a two-dimensional framework for analyzing judge behavior, along axes of intent (from constructive to malicious) and knowledge (from parametric-only to retrieval-augmented systems). Using this taxonomy, we construct a suite of judge behaviors and develop WAFER-QA, a new benchmark with critiques grounded in retrieved web evidence to evaluate robustness of agentic workflows against factually supported adversarial feedback. We reveal that even strongest agents are vulnerable to persuasive yet flawed critiques -- often switching correct answers after a single round of misleading feedback. Taking a step further, we study how model predictions evolve over multiple rounds of interaction, revealing distinct behavioral patterns between reasoning and non-reasoning models. Our findings highlight fundamental vulnerabilities in feedback-based workflows and offer guidance for building more robust agentic systems.
Problem

Research questions and friction points this paper is trying to address.

Analyzing vulnerabilities in agentic workflows from deceptive feedback
Developing a framework to classify judge behavior in LLM interactions
Assessing agent robustness against adversarial critiques using WAFER-QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-dimensional framework for judge behavior analysis
WAFER-QA benchmark for adversarial feedback evaluation
Study of model predictions over multiple feedback rounds
🔎 Similar Papers
No similar papers found.