Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

πŸ“… 2026-04-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

229K/year
πŸ€– AI Summary
Large language models are vulnerable to security threats such as prompt jailbreaking, backdoor injection, and retention of harmful knowledge. This work proposes Latent Instruction Representation Alignment (LIRA), a novel approach that uniquely focuses on aligning the model’s internal representations of instructions rather than merely optimizing output behavior, and integrates adversarial training in latent space to enhance generalization. LIRA demonstrates exceptional performance in both defense and unlearning tasks: it successfully blocks over 99% of PEZ-based jailbreak attacks, effectively eradicates complex unsafe code backdoors, and achieves state-of-the-art unlearning results on the cybersecurity subset of the WMDP benchmark, all while preserving the model’s normal functionality with minimal degradation.

Technology Category

Application Category

πŸ“ Abstract
We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
Problem

Research questions and friction points this paper is trying to address.

jailbreaks
backdoors
unlearning
large language models
undesired knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Instruction Representation Alignment
jailbreak defense
backdoor removal
adversarial training
machine unlearning
πŸ”Ž Similar Papers
2024-01-12International Conference on Computational LinguisticsCitations: 11