π€ AI Summary
Large language models are vulnerable to security threats such as prompt jailbreaking, backdoor injection, and retention of harmful knowledge. This work proposes Latent Instruction Representation Alignment (LIRA), a novel approach that uniquely focuses on aligning the modelβs internal representations of instructions rather than merely optimizing output behavior, and integrates adversarial training in latent space to enhance generalization. LIRA demonstrates exceptional performance in both defense and unlearning tasks: it successfully blocks over 99% of PEZ-based jailbreak attacks, effectively eradicates complex unsafe code backdoors, and achieves state-of-the-art unlearning results on the cybersecurity subset of the WMDP benchmark, all while preserving the modelβs normal functionality with minimal degradation.
π Abstract
We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.