Safety Pretraining: Toward the Next Generation of Safe AI

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the fragility of post-hoc alignment in high-stakes applications—stemming from harmful behavioral patterns acquired by large language models (LLMs) during pretraining—this paper proposes the first data-driven framework for *intrinsic safety acquisition* at the pretraining stage. Our method introduces: (1) a novel harm-label injection mechanism enabling fine-grained safety signal embedding; (2) the largest synthetic safety dataset to date (100B tokens), incorporating both *RefuseWeb*-style refusal dialogues and web-style moral education content generation; and (3) a base-model-level safety evaluation suite. Experiments demonstrate that our approach reduces adversarial attack success rates from 38.8% to 8.4%, achieves zero performance degradation on standard safety benchmarks, and substantially enhances the native safety of foundation models without compromising utility.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Reducing harmful content generation in large language models

Developing data-centric pretraining for inherent model safety

Improving safety without degrading standard LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety classifier filters 600B tokens

Synthetic safety dataset via recontextualization

Harmfulness-Tag annotations steer inference

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?