Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

πŸ“… 2025-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work exposes a fundamental vulnerability of safety-aligned large language models (LLMs) and multimodal LLMs (MLLMs) under out-of-distribution (OOD) harmful inputs: current alignment methods fail to generalize to OOD adversarial samples. Method: We propose JOODβ€”a gradient-free, model-agnostic jailbreaking framework that applies lightweight cross-modal OOD perturbations (e.g., image mixup, text style transfer, OCR noise injection, and semantic confusion) to disrupt the model’s discrimination of malicious intent and bypass safety guardrails. Contribution/Results: Leveraging uncertainty-driven attack strategies, JOOD achieves >85% success rates against strongly aligned commercial models (e.g., GPT-4, o1), significantly outperforming prior approaches. This is the first systematic study to uncover the failure mechanisms of safety alignment under OOD conditions, empirically validating its generalization limits. Our findings provide critical insights for robust safety alignment and establish a new benchmark for evaluating alignment robustness against distributional shifts.

Technology Category

Application Category

πŸ“ Abstract
Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective in increasing the uncertainty of the model, thereby facilitating the bypass of the safety alignment. Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and o1 with high attack success rate, which previous attack approaches have consistently struggled to jailbreak. Code is available at https://github.com/naver-ai/JOOD.
Problem

Research questions and friction points this paper is trying to address.

Investigates vulnerability of LLMs to jailbreaking via out-of-distribution inputs
Proposes JOOD framework to bypass safety alignment using OOD-ifying techniques
Demonstrates high success rate in jailbreaking models like GPT-4
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses out-of-distribution strategy for jailbreaking
Applies visual and textual transformation techniques
Leverages simple mixing-based techniques effectively
πŸ”Ž Similar Papers
No similar papers found.
Joonhyun Jeong
Joonhyun Jeong
KAIST, NAVER Cloud
AI
S
Seyun Bae
Korea Advanced Institute of Science and Technology (KAIST)
Yeonsung Jung
Yeonsung Jung
KAIST
Self-Improving AgentsVisual ReasoningReliabe AI
J
Jaeryong Hwang
Republic of Korea Naval Academy
Eunho Yang
Eunho Yang
KAIST
Machine LearningStatistics