A Granular Study of Safety Pretraining under Model Abliteration

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study investigates the robustness of mainstream safety interventions—such as refusal training and meta-label training—in open-weight large language models under lightweight activation-based editing (i.e., model pruning) at inference time. We systematically evaluate how well multiple safety-pretrained checkpoints retain their refusal behavior post-pruning, introducing a fine-grained checkpoint analysis framework that integrates self-referential refusal detection, multi-judge classification, and human-annotated validation to establish a joint protocol for inference-time editing and safety assessment. Results show that certain safety mechanisms degrade significantly under pruning, with refusal-sensitive directions particularly vulnerable to removal; judge selection substantially impacts evaluation outcomes; and data-driven safety components exhibit pronounced inter-checkpoint capability variance. This work provides the first quantitative evidence of how pruning undermines safety alignment, empirically delineating the safety boundaries of inference-time model editing.

Technology Category

Application Category

📝 Abstract

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety training robustness against model activation edits

Assessing refusal behavior survival under lightweight projection techniques

Characterizing checkpoint-level safety component resilience to inference-time modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates safety pretraining robustness under model abliteration

Uses granular checkpoint analysis with controlled refusal classification

Integrates inference-time edits into safety assessment protocols

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?