Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key challenge in safety alignment for open-source large language models: mitigating harmful outputs without excessively rejecting harmless requests. To this end, the authors propose CADA, a novel approach that replaces conventional rule-based safety instructions with a case-augmented reasoning mechanism. By integrating legal-style precedents with concise rules, CADA enables the model to autonomously generate safety-aware reasoning chains in complex scenarios. The method further employs reinforcement learning to optimize a deliberative alignment process. Experimental results demonstrate that CADA significantly enhances model harmlessness and robustness against adversarial attacks while substantially reducing over-refusal rates. Moreover, it maintains high utility across multiple benchmarks, offering an efficient, generalizable, and adaptive safety alignment framework for open-source large language models.

Technology Category

Application Category

📝 Abstract
Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like''safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
over-refusal
deliberative alignment
harmlessness
helpfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

case-augmented reasoning
deliberative alignment
reinforcement learning
safety alignment
over-refusal reduction
🔎 Similar Papers
No similar papers found.