Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety alignment vulnerability of reasoning-enhanced vision-language models (RVLMs). We propose a lightweight, stealthy adversarial attack method that exploits exposed chain-of-thought (CoT) traces. Our approach introduces a segment-level perturbation mechanism and a self-generated reasoning trajectory reuse strategy, coupled with a turn-weighted loss to subtly disrupt alignment mechanisms under minimal supervision—without degrading the original reasoning distribution. Efficient fine-tuning is achieved via QLoRA, requiring only 499 samples, a single A100 GPU, and under three hours. On multiple benchmarks including AdvBench, our method achieves a 38.52% higher attack success rate than IDEATOR, while preserving the model’s general multimodal reasoning capability. To the best of our knowledge, this is the first RVLM alignment attack demonstrating low-resource efficiency, high stealthiness, and strong generalization—marking a significant advance in adversarial alignment research for reasoning-augmented multimodal models.

Technology Category

Application Category

📝 Abstract
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed extbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through extbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a extbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. extcolor{red}{ extbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
Problem

Research questions and friction points this paper is trying to address.

Breaking safety alignment in reasoning-augmented vision-language models
Developing stealth fine-tuning method using self-generated reasoning traces
Creating low-cost attack to bypass alignment defenses efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level interference elicits harmful reasoning traces
Self-generated outputs reused as fine-tuning data
Turn-based weighted loss enables lightweight distribution-consistent tuning
🔎 Similar Papers
No similar papers found.
L
Le Yu
Machine Intelligence Laboratory, Sichuan University
Z
Zhengyue Zhao
University of Wisconsin–Madison
Y
Yawen Zheng
Department of Automation, Tsinghua University
Yunhao Liu
Yunhao Liu
ACM Fellow, IEEE Fellow, CCF Fellow, Tsinghua University
Wireless Sensor Networks/RFIDCyber Physical Systems and IoTPrivacy and SecurityCloud Computing