Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak

๐Ÿ“… 2024-05-30
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing direct jailbreaking attacks against large language models (LLMs) suffer from low success rates and poor generalizability due to the robustness and complexity of LLMs. Method: This paper proposes a novel *indirect jailbreaking* paradigm that leverages multimodal large language models (MLLMs) as surrogates: (1) efficiently jailbreak an MLLM to extract adversarial embeddings; (2) select initial textual prompts via imageโ€“text semantic matching; and (3) transfer the learned adversarial embeddings into textual suffixes for attacking target LLMs. Contribution/Results: By exploiting the relative vulnerability of MLLMs and establishing the first cross-modal embedding-space transfer pathway, our method circumvents the difficulty of direct LLM jailbreaking while significantly improving attack success rate, cross-model transferability, and cross-task generalization. Experiments demonstrate consistent superiority over state-of-the-art approaches in both efficiency and effectiveness, offering a new perspective for LLM security evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.
Problem

Research questions and friction points this paper is trying to address.

Efficiently jailbreak LLMs via multimodal-LLM vulnerabilities
Convert MLLM jailbreak embeddings to textual suffixes
Improve attack success with image-text semantic matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Construct MLLM upon target LLM for jailbreak
Convert jailbreaking embedding to textual suffix
Use image-text matching to boost attack success
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhenxing Niu
Xidian University
Y
Yuyao Sun
Xidian University
H
Haoxuan Ji
Xiโ€™an Jiaotong University
Gang Hua
Gang Hua
Director of Applied Science, AI, Amazon.com, Inc., IEEE & IAPR Fellow
Computer VisionMachine LearningArtificial IntelligenceRoboticsMultimedia
R
Rong Jin
Meta
H
Haichang Gao
X
Xinbo Gao
Z
Zheng Lin