🤖 AI Summary
Large reasoning models (LRMs) exhibit “superficial safety alignment” (SSA)—a critical failure mode wherein outputs appear safe while internal reasoning fails to genuinely identify or mitigate underlying risks, especially in safety-critical applications.
Method: We formally define and quantify SSA; introduce BSA, a benchmark comprising 2,000 fine-grained instances spanning nine risk categories and three SSA scenarios, with expert-annotated risk attributions; and propose evaluation techniques including multi-sampling consistency analysis, safety rule injection, safety-aware reasoning fine-tuning, and diverse decoding.
Contribution/Results: Empirical evaluation across 19 state-of-the-art LRMs reveals a maximum accuracy of only 38.0% on risk attribution identification—demonstrating severe deficiencies in chain-of-reasoning–level safety robustness. BSA constitutes the first systematic benchmark and methodology for rigorously assessing and advancing deep safety alignment in LRMs.
📝 Abstract
Despite the remarkable proficiency of extit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as extbf{ extit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce extbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.