Optimization-based Prompt Injection Attack to LLM-as-a-Judge

📅 2024-03-26

🏛️ Conference on Computer and Communications Security

📈 Citations: 23

✨ Influential: 1

career value

174K/year

🤖 AI Summary

This work introduces a novel prompt injection attack targeting LLM-as-a-Judge systems, designed to consistently rank an attacker-controlled malicious response first among multiple candidates. Methodologically, it is the first to formulate prompt injection as a differentiable optimization problem, enabling gradient-driven, token-level trigger sequence learning—fully automated, human-design-free, and robust across diverse candidate sets. The approach integrates differentiable prompt engineering with fine-grained perturbation modeling. It significantly outperforms state-of-the-art attacks across three realistic evaluation scenarios—response ranking, RLHF/RLAIF-based preference learning, and tool selection—as well as multiple standard benchmarks. Moreover, it comprehensively bypasses mainstream defenses—including known-answer detection, perplexity-based filtering, and sliding-window variants—demonstrating both high efficacy and tangible real-world threat potential.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies. Our implementation is available at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.

Problem

Research questions and friction points this paper is trying to address.

Optimization-based attack on LLM-as-a-Judge decision-making.

Crafting sequences to manipulate LLM-as-a-Judge selections.

Evaluating defenses against prompt injection attacks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization-based prompt injection attack

Gradient-based method for sequence crafting

Evaluation in LLM-powered search and RLAIF

🔎 Similar Papers

Formalizing and Benchmarking Prompt Injection Attacks and Defenses