Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “blind-attack” prompt injection threat in large language model (LLM) evaluation systems—where adversaries generate semantically irrelevant candidate answers to evade detection. We propose a dual-track defense framework integrating Standard Evaluation (SE) and Counterfactual Evaluation (CFE). First, we formally define blind attacks. Second, we introduce a novel, annotation-free CFE mechanism that detects attacks via logical inconsistency: it compares evaluation outputs on the original input versus a semantically preserved perturbed input. Experiments demonstrate that our method preserves baseline evaluation accuracy while significantly improving blind-attack detection (+32.7% F1), achieving strong security, robustness, and practicality without requiring additional human annotations or model fine-tuning.

Technology Category

Application Category

📝 Abstract
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
Problem

Research questions and friction points this paper is trying to address.

Defends LLM-based evaluation systems against prompt injection
Detects blind attacks using counterfactual evaluation
Improves security with minimal performance trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments standard evaluation with counterfactual evaluation
Detects attacks by validating under false conditions
Improves security with minimal performance trade-offs
🔎 Similar Papers
No similar papers found.