🤖 AI Summary
This study investigates the robustness of mainstream safety interventions—such as refusal training and meta-label training—in open-weight large language models under lightweight activation-based editing (i.e., model pruning) at inference time. We systematically evaluate how well multiple safety-pretrained checkpoints retain their refusal behavior post-pruning, introducing a fine-grained checkpoint analysis framework that integrates self-referential refusal detection, multi-judge classification, and human-annotated validation to establish a joint protocol for inference-time editing and safety assessment. Results show that certain safety mechanisms degrade significantly under pruning, with refusal-sensitive directions particularly vulnerable to removal; judge selection substantially impacts evaluation outcomes; and data-driven safety components exhibit pronounced inter-checkpoint capability variance. This work provides the first quantitative evidence of how pruning undermines safety alignment, empirically delineating the safety boundaries of inference-time model editing.
📝 Abstract
Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.