Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Fine-tuning language models often synchronously acquires both desired and undesired behaviors, leading to alignment failure. To address this, we propose *Inoculation Prompting*: injecting carefully designed system prompts during fine-tuning to actively elicit and explicitly model undesirable features, thereby suppressing their expression at inference time and enabling selective alignment. This is the first application of an immunology-inspired “inoculation” mechanism to large language model training. Theoretically, our method enhances predictability of undesirable features, alleviating model generalization pressure; it further elucidates the intrinsic mechanism by which pedagogical prompting mitigates unsafe code alignment. Empirically, we validate Inoculation Prompting across multitask learning, backdoor defense, and implicit feature suppression: it significantly inhibits emergent non-compliant behaviors and backdoor triggers, blocks subconscious feature propagation, and preserves primary task performance without degradation.

Technology Category

Application Category

📝 Abstract

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Problem

Research questions and friction points this paper is trying to address.

Suppressing undesirable traits in fine-tuned language models

Reducing emergent misalignment from task-specific finetuning

Mitigating backdoor injections and subliminal trait transmission

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inoculation prompting modifies finetuning data with prompts

It deliberately elicits undesirable traits during training

This reduces trait expression and generalization at test-time

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation