🤖 AI Summary
This work systematically investigates the optimal timing for introducing safety interventions during the pretraining phase of language models, addressing their vulnerability to adversarial attacks and degradation under downstream fine-tuning. Holding the pretraining data fixed, the study varies only the intervention onset—introducing safety curricula after 0%, 20%, or 60% of the token budget—and employs linear probing to analyze internal representations alongside evaluations of downstream fine-tuning robustness and safety-aware reasoning. Results demonstrate that early integration of safety interventions significantly enhances model robustness and steerability against both adversarial inputs and downstream tasks, while maintaining a low over-rejection rate and strengthening the model’s internal capacity to discriminate between safe and harmful content.
📝 Abstract
Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked:"When during pretraining should safety interventions be introduced?"We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.