What Matters For Safety Alignment?

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study systematically investigates the key factors influencing safety alignment in large language models (LLMs) and reasoning language models (LRMs), along with their vulnerabilities under adversarial attacks. Through a large-scale empirical analysis encompassing 32 mainstream models (3B–235B parameters), 13 model families, five safety benchmarks, and 60 attack strategies, the work reveals that integrating reasoning and self-reflection mechanisms significantly enhances safety alignment. It further demonstrates that post-training and knowledge distillation can systematically degrade model safety. Notably, prefix-based chain-of-thought attacks are shown to increase jailbreak success rates by 3.34×, exposing critical risks in text-completion interfaces. The study identifies GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the safest models, underscoring safety as a core optimization objective in post-training.

Technology Category

Application Category

📝 Abstract

This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

large language models

reasoning-enhanced models

adversarial attacks

model vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety alignment

reasoning and self-reflection

chain-of-thought (CoT) attack