When Should We Introduce Safety Interventions During Pretraining?

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work systematically investigates the optimal timing for introducing safety interventions during the pretraining phase of language models, addressing their vulnerability to adversarial attacks and degradation under downstream fine-tuning. Holding the pretraining data fixed, the study varies only the intervention onset—introducing safety curricula after 0%, 20%, or 60% of the token budget—and employs linear probing to analyze internal representations alongside evaluations of downstream fine-tuning robustness and safety-aware reasoning. Results demonstrate that early integration of safety interventions significantly enhances model robustness and steerability against both adversarial inputs and downstream tasks, while maintaining a low over-rejection rate and strengthening the model’s internal capacity to discriminate between safe and harmful content.

Technology Category

Application Category

📝 Abstract

Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked:"When during pretraining should safety interventions be introduced?"We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.

Problem

Research questions and friction points this paper is trying to address.

safety interventions

pretraining

language model safety

alignment robustness

downstream finetuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety interventions

pretraining timing

robust alignment