Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study systematically evaluates safety alignment disparities between DeepSeek (a Mixture-of-Experts, MoE, architecture) and GPT-3.5/GPT-4 under jailbreaking attacks. Using the HarmBench benchmark, we conduct fine-grained empirical analysis across 510 harmful behaviors spanned by seven representative attack categories. We first discover that DeepSeek exhibits selective robustness against optimization-based attacks—attributable to routing sparsity—but demonstrates heightened vulnerability to prompt-engineering attacks, revealing uneven expert-level alignment that induces inconsistent refusal behavior. In contrast, GPT-4 Turbo achieves stronger and more stable safety alignment. Our core contribution is twofold: (1) uncovering an inherent trade-off between computational efficiency and alignment generalization in open-weight MoE models; and (2) proposing an architecture-aware, fine-grained analytical framework for large language model safety evaluation—enabling precise diagnosis of alignment failures at the expert-routing level.

Technology Category

Application Category

📝 Abstract

The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates DeepSeek and GPT models' vulnerability to jailbreak attacks

Compares robustness of open-source and proprietary LLMs against adversarial prompts

Analyzes trade-offs between architectural efficiency and safety alignment in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic jailbreak evaluation using HarmBench benchmark

Analyzes MoE architecture's routing sparsity impact

Compares safety alignment of dense vs MoE designs

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models