OET: Optimization-based prompt injection Evaluation Toolkit

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of standardized, adaptive evaluation frameworks for prompt injection attacks hinders rigorous assessment of LLM robustness. Method: This paper introduces the first optimization-based toolkit for evaluating prompt injection attacks and defenses in LLMs. It innovatively integrates white-box gradient optimization with query-efficient black-box optimization (e.g., Bayesian optimization) to construct a schedulable, modular, adaptive red-teaming framework capable of generating worst-case prompt samples and conducting stringent robustness evaluations. Contribution/Results: It is the first to enable adaptive, hybrid white-box/black-box attack generation; establishes an open-source, reproducible standardized evaluation pipeline (OET); and empirically demonstrates significant vulnerabilities in state-of-the-art defense mechanisms—several hardened models remain jailbreakable under OET, validating the framework’s rigor and practical utility.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, enabling their widespread adoption across various domains. However, their susceptibility to prompt injection attacks poses significant security risks, as adversarial inputs can manipulate model behavior and override intended instructions. Despite numerous defense strategies, a standardized framework to rigorously evaluate their effectiveness, especially under adaptive adversarial scenarios, is lacking. To address this gap, we introduce OET, an optimization-based evaluation toolkit that systematically benchmarks prompt injection attacks and defenses across diverse datasets using an adaptive testing framework. Our toolkit features a modular workflow that facilitates adversarial string generation, dynamic attack execution, and comprehensive result analysis, offering a unified platform for assessing adversarial robustness. Crucially, the adaptive testing framework leverages optimization methods with both white-box and black-box access to generate worst-case adversarial examples, thereby enabling strict red-teaming evaluations. Extensive experiments underscore the limitations of current defense mechanisms, with some models remaining susceptible even after implementing security enhancements.
Problem

Research questions and friction points this paper is trying to address.

Evaluating susceptibility of LLMs to prompt injection attacks
Lacking standardized framework for adaptive adversarial testing
Assessing effectiveness of defenses against worst-case adversarial examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization-based adversarial example generation
Modular workflow for attack and defense evaluation
Adaptive testing framework with white-black-box access
🔎 Similar Papers
No similar papers found.