Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address critical challenges in LLM safety reasoning—including over-rejection, jailbreaking attacks, hallucination, and policy conflicts—this paper proposes a multi-agent iterative deliberation training paradigm for safety alignment. Methodologically, it introduces: (1) a policy-embedded chain-of-thought (CoT) generation mechanism, where AI agents collaboratively deliberate to produce high-fidelity, conflict-free reasoning traces; (2) a data refinement and belief-augmented preference sampling pipeline that explicitly mitigates redundant reasoning and hallucination; and (3) a unified training framework integrating supervised fine-tuning (SFT) and direct preference optimization (DPO). Evaluated on multiple safety benchmarks, the approach achieves state-of-the-art performance, significantly improving open-source LLMs’ jailbreak robustness and safety generalization while reducing over-rejection rates and preserving response utility.

Technology Category

Application Category

📝 Abstract
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE
Problem

Research questions and friction points this paper is trying to address.

Creating high-quality policy-embedded CoT datasets for LLMs
Ensuring accurate safety reasoning without hallucinations or conflicts
Generating preference data for alignment stages like DPO training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent deliberation for safety reasoning
Data refiner ensures high-quality outputs
Belief augmentation creates preference data
🔎 Similar Papers
No similar papers found.