🤖 AI Summary
Fine-tuning language models often synchronously acquires both desired and undesired behaviors, leading to alignment failure. To address this, we propose *Inoculation Prompting*: injecting carefully designed system prompts during fine-tuning to actively elicit and explicitly model undesirable features, thereby suppressing their expression at inference time and enabling selective alignment. This is the first application of an immunology-inspired “inoculation” mechanism to large language model training. Theoretically, our method enhances predictability of undesirable features, alleviating model generalization pressure; it further elucidates the intrinsic mechanism by which pedagogical prompting mitigates unsafe code alignment. Empirically, we validate Inoculation Prompting across multitask learning, backdoor defense, and implicit feature suppression: it significantly inhibits emergent non-compliant behaviors and backdoor triggers, blocks subconscious feature propagation, and preserves primary task performance without degradation.
📝 Abstract
Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.