Deliberative Alignment: Reasoning Enables Safer Language Models

📅 2024-12-20

📈 Citations: 1

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the challenge of large language models (LLMs) unreliably adhering to explicit safety rules in high-stakes scenarios, this paper proposes a reflective alignment paradigm. Our method structurally encodes safety specifications and introduces a “specification-driven explicit reasoning” mechanism that compels the model to proactively retrieve and reason over relevant policies prior to generation—without requiring human-annotated chain-of-thought or answer labels. Integrating instruction tuning with policy-guided reasoning triggering and supervised reasoning-path regularization, the approach ensures faithful policy execution. Evaluated on OpenAI’s o-series models, it achieves substantial improvements: 32% higher jailbreak resistance, 41% reduction in over-refusal rate, and 27% gain in out-of-distribution (OOD) generalization accuracy—demonstrating simultaneous gains in robustness and practical utility.

Technology Category

Application Category

📝 Abstract

As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Security

Rule Compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deliberative Alignment

Safety-enhanced Language Models

Transparent Decision-making

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance