From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Traditional large language models (e.g., ChatGPT) rely on binary intent rejection—either fully complying with or refusing user prompts—rendering them inadequate for ambiguous or dual-use queries (e.g., in biosecurity or cybersecurity), often resulting in either over- or under-safety. This work proposes an output-centric safety alignment paradigm that shifts the focus from *whether* to respond to *how* to respond safely. By integrating output-directed safety fine-tuning with real-world comparative data and controlled experiments, the method optimizes model generation behavior under explicit policy constraints. It significantly enhances robustness against ambiguous and dual-use requests while preserving response utility. Evaluation on GPT-5 demonstrates a marked reduction in residual risk severity, substantial improvement in helpfulness for legitimate queries, and a superior trade-off between safety and helpfulness.

Technology Category

Application Category

📝 Abstract

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

Problem

Research questions and friction points this paper is trying to address.

Traditional refusal-based safety training lacks flexibility for ambiguous prompts

Binary refusal boundaries fail in dual-use scenarios like biology or cybersecurity

Proposing safe-completions to ensure output safety while maximizing helpfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safe-completions focus on output safety

Maximizes helpfulness within safety constraints

Reduces severity of residual safety failures

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?