Text is All You Need for Vision-Language Model Jailbreaking

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Text-DJ, a novel attack method that exposes a critical security vulnerability in large vision-language models (LVLMs) when processing distributed harmful text via OCR channels. By decomposing malicious queries into semantically coherent yet superficially benign sub-queries, embedding them among numerous distractors, and arranging these fragments into a grid of images, Text-DJ effectively bypasses conventional text-based safety filters through the OCR pipeline. The approach establishes an end-to-end multimodal jailbreaking framework, achieving high attack success rates across multiple mainstream LVLMs. This demonstrates the fragility of current alignment and safety mechanisms against fragmented, multimodal inputs, highlighting an urgent need for robust defenses that account for cross-modal interaction and compositional threats.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs'OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Jailbreaking
OCR
Safety Alignment
Adversarial Inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attack
vision-language models
OCR vulnerability
multimodal adversarial input
safety alignment bypass
🔎 Similar Papers
No similar papers found.
Yihang Chen
Yihang Chen
UCLA
Large Language ModelsAlignment
Z
Zhao Xu
University of California, Los Angeles
Y
Youyuan Jiang
University of California, Los Angeles
T
Tianle Zheng
University of California, Los Angeles
Cho-Jui Hsieh
Cho-Jui Hsieh
University of California, Los Angeles
Machine LearningOptimization