Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the safety alignment vulnerability of reasoning-enhanced vision-language models (RVLMs). We propose a lightweight, stealthy adversarial attack method that exploits exposed chain-of-thought (CoT) traces. Our approach introduces a segment-level perturbation mechanism and a self-generated reasoning trajectory reuse strategy, coupled with a turn-weighted loss to subtly disrupt alignment mechanisms under minimal supervision—without degrading the original reasoning distribution. Efficient fine-tuning is achieved via QLoRA, requiring only 499 samples, a single A100 GPU, and under three hours. On multiple benchmarks including AdvBench, our method achieves a 38.52% higher attack success rate than IDEATOR, while preserving the model’s general multimodal reasoning capability. To the best of our knowledge, this is the first RVLM alignment attack demonstrating low-resource efficiency, high stealthiness, and strong generalization—marking a significant advance in adversarial alignment research for reasoning-augmented multimodal models.

Technology Category

Application Category

📝 Abstract

Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed extbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through extbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a extbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. extcolor{red}{ extbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}

Problem

Research questions and friction points this paper is trying to address.

Breaking safety alignment in reasoning-augmented vision-language models

Developing stealth fine-tuning method using self-generated reasoning traces

Creating low-cost attack to bypass alignment defenses efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level interference elicits harmful reasoning traces

Self-generated outputs reused as fine-tuning data

Turn-based weighted loss enables lightweight distribution-consistent tuning

🔎 Similar Papers

Attacking Large Language Models with Projected Gradient Descent