Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing gradient-based universal image jailbreaking methods exhibit poor transferability across vision-language models (VLMs) and struggle to achieve effective cross-model, untargeted attacks. This work identifies that model refusal behavior is concentrated in high-entropy tokens during autoregressive decoding and attributes limited transferability to overly constrained optimization objectives. To address this, the authors propose a lightweight jailbreaking approach that maximizes output entropy only at critical decision positions to flip refusal outcomes, while stabilizing low-entropy positions to preserve textual coherence, augmented with KL divergence regularization. The method achieves competitive white-box attack success rates across three VLMs and two safety benchmarks, significantly enhances cross-model transferability, and remains robust under representative defense mechanisms.

📝 Abstract

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

Problem

Research questions and friction points this paper is trying to address.

multimodal jailbreak

transferability

vision-language models

untargeted attack

safety alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy maximization

untargeted jailbreak

vision-language models