STAIR: Improving Safety Alignment with Introspective Reasoning

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Addressing two critical challenges in large language model (LLM) safety alignment—rigid refusal behavior and vulnerability to jailbreaking attacks—this paper proposes an introspective reasoning–driven safety alignment framework. Methodologically, it introduces (1) Safety-Enhanced Monte Carlo Tree Search (SI-MCTS), the first MCTS variant designed to generate step-level safe reasoning traces; (2) an integrated pipeline combining process-based reward modeling, test-time inference search, and self-improving chain-of-thought (CoT) reasoning to jointly optimize risk detection and response; and (3) step-level preference optimization, which overcomes the traditional performance–safety trade-off inherent in blanket refusal policies. Experiments demonstrate that our approach substantially reduces harmful output rates while preserving high utility, achieving safety robustness on par with Claude-3.5 across mainstream jailbreaking attacks. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

Problem

Research questions and friction points this paper is trying to address.

Improving safety in LLMs

Mitigating jailbreak attacks

Enhancing safety-performance balance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving chain-of-thought reasoning

Safety-Informed Monte Carlo Tree Search

Iterative preference optimization on step-level data

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

2024-07-31arXiv.orgCitations: 5