Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models are vulnerable to security threats such as prompt jailbreaking, backdoor injection, and retention of harmful knowledge. This work proposes Latent Instruction Representation Alignment (LIRA), a novel approach that uniquely focuses on aligning the model’s internal representations of instructions rather than merely optimizing output behavior, and integrates adversarial training in latent space to enhance generalization. LIRA demonstrates exceptional performance in both defense and unlearning tasks: it successfully blocks over 99% of PEZ-based jailbreak attacks, effectively eradicates complex unsafe code backdoors, and achieves state-of-the-art unlearning results on the cybersecurity subset of the WMDP benchmark, all while preserving the model’s normal functionality with minimal degradation.

Technology Category

Application Category

📝 Abstract

We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

Problem

Research questions and friction points this paper is trying to address.

jailbreaks

backdoors

unlearning

large language models

undesired knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Instruction Representation Alignment

jailbreak defense

backdoor removal