🤖 AI Summary
This work addresses the safety alignment vulnerability of reasoning-enhanced vision-language models (RVLMs). We propose a lightweight, stealthy adversarial attack method that exploits exposed chain-of-thought (CoT) traces. Our approach introduces a segment-level perturbation mechanism and a self-generated reasoning trajectory reuse strategy, coupled with a turn-weighted loss to subtly disrupt alignment mechanisms under minimal supervision—without degrading the original reasoning distribution. Efficient fine-tuning is achieved via QLoRA, requiring only 499 samples, a single A100 GPU, and under three hours. On multiple benchmarks including AdvBench, our method achieves a 38.52% higher attack success rate than IDEATOR, while preserving the model’s general multimodal reasoning capability. To the best of our knowledge, this is the first RVLM alignment attack demonstrating low-resource efficiency, high stealthiness, and strong generalization—marking a significant advance in adversarial alignment research for reasoning-augmented multimodal models.
📝 Abstract
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed extbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through extbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a extbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. extcolor{red}{ extbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}