🤖 AI Summary
This work addresses the security vulnerabilities of large language models (LLMs) under adversarial prompts—including suffix appending, mid-sequence insertion, and discrete injection—by proposing *erase-and-check*, the first provably secure defense framework. The method systematically erases input tokens one-by-one and invokes safety classifiers (e.g., Llama-2, DistilBERT) to verify the safety of each resulting subsequence, thereby providing theoretical guarantees against harmful outputs. It introduces an erasure-based certification paradigm and designs three efficient heuristic algorithms—RandEC, GreedyEC, and GradEC—that integrate random sampling, gradient-guided optimization, and softmax-based pruning to balance rigorous safety certification with computational efficiency. Experiments demonstrate that erase-and-check significantly improves robustness against strong attacks such as GCG, achieves high certified safety rates, and maintains high pass rates for benign prompts—marking the first approach to deliver both empirical and theoretical security guarantees in adversarial prompting scenarios.
📝 Abstract
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects the resulting subsequences using a safety filter. Our safety certificate guarantees that harmful prompts are not mislabeled as safe due to an adversarial attack up to a certain size. We implement the safety filter in two ways, using Llama 2 and DistilBERT, and compare the performance of erase-and-check for the two cases. We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block. Our experimental results demonstrate that this procedure can obtain strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts. Additionally, we propose three efficient empirical defenses: i) RandEC, a randomized subsampling version of erase-and-check; ii) GreedyEC, which greedily erases tokens that maximize the softmax score of the harmful class; and iii) GradEC, which uses gradient information to optimize tokens to erase. We demonstrate their effectiveness against adversarial prompts generated by the Greedy Coordinate Gradient (GCG) attack algorithm. The code for our experiments is available at https://github.com/aounon/certified-llm-safety.