JULI: Jailbreak Large Language Models by Self-Introspection

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the security alignment of API-based black-box large language models (LLMs), proposing a lightweight jailbreaking method that requires no access to model weights or gradients. The method employs introspective probability perturbation—specifically, manipulating the log-probabilities of the first five output tokens—to systematically circumvent safety constraints. It introduces BiasNet, a plug-in module integrated with zero-shot prompt optimization, enabling the first efficient alignment bypass under purely black-box conditions. Extensive experiments across multiple mainstream closed-source LLMs demonstrate that the approach significantly outperforms existing state-of-the-art methods: it reduces refusal rates by 23.6% and substantially increases malicious response rates. These results validate both the effectiveness and cross-model generalizability of the proposed technique.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.

Problem

Research questions and friction points this paper is trying to address.

Jailbreak LLMs without model weight access

Manipulate token log probabilities for attacks

Compromise API-calling LLMs in black-box settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses BiasNet for token log probabilities manipulation

Jailbreaks API-calling LLMs in black-box settings

Requires only top-5 token log probabilities knowledge

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation