The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study reveals the safety alignment vulnerability of chain-of-thought (CoT)-enhanced large language models (e.g., DeepSeek) under adversarial fine-tuning attacks. Prior work overlooks risks arising from interactions between CoT mechanisms and adversarial inputs; we are the first to systematically demonstrate that malicious instruction fine-tuning can deliberately manipulate reasoning chains to generate high-harm outputs—and that CoT amplifies, rather than mitigates, attack efficacy. Methodologically, we construct an adversarial fine-tuning dataset and integrate behavioral trajectory analysis, quantitative harmfulness evaluation, and reasoning-path attribution to enable interpretable modeling of the attack process. Experiments show that after attack, DeepSeek’s harmful response rate increases by 3.2×, and its safety alignment robustness degrades significantly. Our core contribution is the proposal of the “CoT–adversarial input coupling failure” paradigm—a novel conceptual framework that advances theoretical understanding and provides empirical grounding for defending CoT-augmented models against fine-tuning–based alignment subversion.

Technology Category

Application Category

📝 Abstract

Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model's output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

Problem

Research questions and friction points this paper is trying to address.

DeepSeek Model

Adversarial Attacks

Ethical Usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepSeek Model

Adversarial Attacks

SecurityEnhancement

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?