๐ค AI Summary
This work addresses a critical limitation in current safety evaluations of language models, which focus solely on final outputs while ignoring harmful content embedded in reasoning trajectories. The authors propose the first multi-principle safety framework that jointly assesses both reasoning processes and final answers, systematically identifying two high-risk failure modesโโleakageโ and โescapeโโand revealing that risks are concentrated along specific safety principle dimensions. To mitigate these issues, they introduce a white-box, test-time adaptive guidance method that dynamically activates relevant safety constraints based on clustering of hidden states, enabling fine-grained intervention with minimal disruption to model performance. Evaluated across three open-source models, this approach reduces unsafe outputs by an average of 40.8% (e.g., on DeepSeek-R1-Qwen-7B) while preserving 97.7% of the macro-average accuracy on BBH, GSM8K, and MMLU benchmarks.
๐ Abstract
Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in misinformation, legal compliance, discrimination, physical harm, and psychological harm. We further propose adaptive multi-principle steering, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe than safe centroid. On three steerable open reasoning models, adaptive steering reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.