🤖 AI Summary
Best-of-N (BoN) stochastic prompt injection attacks—particularly those exploiting minor orthographic perturbations (e.g., case variations, punctuation) to achieve jailbreaking—pose a critical threat to LLM safety.
Method: This paper proposes an iterative safety scoring framework powered by evaluation-oriented large language models. It explicitly incorporates jailbreaking intent recognition into the prompt evaluation pipeline, jointly modeling semantic perturbation robustness and structured jailbreaking pattern detection. The framework is designed for efficient deployment using lightweight evaluators (e.g., LLaMA-3-8B-Instruct).
Results & Contributions: Our method blocks >99.8% of BoN attacks (95% CI: 99.28%–99.98%) and achieves protection performance comparable to Claude-class models even on resource-constrained evaluators. Key contributions include: (1) explicit modeling of jailbreaking intent, (2) a novel perturbation-robust evaluation paradigm, and (3) an efficient, deployable safety scoring framework.
📝 Abstract
Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that $100%$ of the BoN paper's successful jailbreaks (confidence interval $[99.65%, 100.00%]$) and $99.8%$ of successful jailbreaks in our replication (confidence interval $[99.28%, 99.98%]$) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.