The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses emerging security risks posed by large reasoning models (e.g., DeepSeek-R1, OpenAI-o3). Methodologically, we establish a comprehensive evaluation framework integrating compliance benchmarks (SafeBench, AdvBench), adversarial attacks (jailbreaking, prompt injection), and multi-dimensional robustness analysis. Our key findings are threefold: (1) We first demonstrate that the chain-of-thought (CoT) reasoning traces of R1-class models exhibit higher inherent safety risks than their final outputs; (2) We empirically confirm a positive correlation between enhanced reasoning capability and latent harmfulness; and (3) We reveal that distilled reasoning models exhibit significantly weaker safety alignment compared to their aligned base counterparts. The study identifies four critical safety gaps, providing empirical grounding and actionable pathways for improving safety alignment and governance of reasoning models.

Technology Category

Application Category

📝 Abstract

The rapid development of large reasoning models, such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source R1 models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on R1 is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models pose greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.

Problem

Research questions and friction points this paper is trying to address.

Assess safety risks in reasoning models

Evaluate compliance with safety regulations

Investigate vulnerability to adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive safety assessment using benchmarks

Investigation of adversarial attack susceptibility

Analysis of reasoning model safety gaps

🔎 Similar Papers

No similar papers found.

Authors to Follow