🤖 AI Summary
To address the fragility of post-hoc alignment in high-stakes applications—stemming from harmful behavioral patterns acquired by large language models (LLMs) during pretraining—this paper proposes the first data-driven framework for *intrinsic safety acquisition* at the pretraining stage. Our method introduces: (1) a novel harm-label injection mechanism enabling fine-grained safety signal embedding; (2) the largest synthetic safety dataset to date (100B tokens), incorporating both *RefuseWeb*-style refusal dialogues and web-style moral education content generation; and (3) a base-model-level safety evaluation suite. Experiments demonstrate that our approach reduces adversarial attack success rates from 38.8% to 8.4%, achieves zero performance degradation on standard safety benchmarks, and substantially enhances the native safety of foundation models without compromising utility.
📝 Abstract
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.