🤖 AI Summary
This paper addresses the “blind-attack” prompt injection threat in large language model (LLM) evaluation systems—where adversaries generate semantically irrelevant candidate answers to evade detection. We propose a dual-track defense framework integrating Standard Evaluation (SE) and Counterfactual Evaluation (CFE). First, we formally define blind attacks. Second, we introduce a novel, annotation-free CFE mechanism that detects attacks via logical inconsistency: it compares evaluation outputs on the original input versus a semantically preserved perturbed input. Experiments demonstrate that our method preserves baseline evaluation accuracy while significantly improving blind-attack detection (+32.7% F1), achieving strong security, robustness, and practicality without requiring additional human annotations or model fine-tuning.
📝 Abstract
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.