Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the limited generalization and poor cross-model transferability of existing gradient-based image jailbreaking attacks against vision-language models. To overcome these limitations, the authors propose UltraBreak, a novel framework that enhances attack universality and transferability by introducing spatial transformations and regularization constraints in the visual domain, while simultaneously optimizing a semantics-guided textual objective loss in the language model’s embedding space. UltraBreak is the first method to achieve effective universal attacks across diverse jailbreaking targets and black-box vision-language models, substantially mitigating overfitting to proxy models. Extensive experiments demonstrate that UltraBreak consistently outperforms state-of-the-art approaches across multiple models and attack scenarios, highlighting the critical role of semantic objectives in smoothing the loss landscape and improving transferability.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

vision-language models

transferability

adversarial patterns

model generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal adversarial attack

transferable jailbreak

vision-language models