Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing jailbreaking methods for Large Vision-Language Models (LVLMs) rely on toxic text continuation (Toxic-Continuation), making them ineffective against purely benign inputs. This work introduces the Benign-to-Toxic (B2T) paradigm: it induces harmful model outputs solely by optimizing adversarial images under entirely benign textual prompts—achieving, for the first time, implicit jailbreaking via “benign text + adversarial image → toxic response.” Our approach integrates gradient-based image optimization, multimodal joint safety evaluation, and a black-box transferability testing framework. Experiments demonstrate that B2T significantly outperforms prior methods in both white-box and black-box settings. Moreover, B2T synergizes with text-based jailbreaking to amplify attack efficacy. These results expose a previously underappreciated vulnerability in multimodal alignment—specifically, the susceptibility of visual modality pathways to adversarial manipulation, even when linguistic inputs remain strictly harmless.

Technology Category

Application Category

📝 Abstract
Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
Problem

Research questions and friction points this paper is trying to address.

Inducing toxic outputs from benign prompts in LVLMs
Overcoming limitations of Toxic-Continuation jailbreak methods
Exposing vulnerabilities in multimodal safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimize adversarial images for toxic outputs
Benign-to-Toxic paradigm breaks safety mechanisms
Transfers in black-box settings effectively
🔎 Similar Papers
No similar papers found.
Hee-Seon Kim
Hee-Seon Kim
KAIST
Minbeom Kim
Minbeom Kim
Ph.D student, Seoul National University
Artificial IntelligenceAI SafetyAI ControlAgent Safety
W
Wonjun Lee
Korea Advanced Institute of Science and Technology (KAIST)
K
Kihyun Kim
Korea Advanced Institute of Science and Technology (KAIST)
Changick Kim
Changick Kim
Korea Advanced Institute of Science and Technology
Computer vision