Endless Jailbreaks with Bijection Learning

📅 2024-10-02

🏛️ International Conference on Learning Representations

📈 Citations: 5

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Despite multiple safety mechanisms deployed in large language models (LLMs), they remain vulnerable to adversarial jailbreaking attacks. This paper proposes a bijective learning attack framework: it leverages in-context learning to induce the target model’s bidirectional encoding/decoding capability, then applies controllably complex randomized encoding for fuzzing—bypassing safety alignment and reconstructing the original unfiltered response. We首次 reveal a positive correlation between model capability and optimal attack complexity, demonstrating that scale expansion paradoxically exacerbates vulnerability to bijective attacks and exposing a novel scale-dependent security flaw. Evaluated on multiple state-of-the-art LLMs, the attack achieves high jailbreak success rates, with effectiveness monotonically increasing alongside model capability. Remarkably, the minimal effective attack requires only a handful of key-value mappings.

Technology Category

Application Category

📝 Abstract

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.

Problem

Research questions and friction points this paper is trying to address.

LLMs vulnerable to adversarial jailbreak attacks

Bijection learning bypasses safety mechanisms via encoded queries

More capable models more susceptible to complex bijection attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bijection learning for automatic LLM fuzzing

In-context learning with bijective encodings

Complexity control for effective jailbreak attacks

🔎 Similar Papers

No similar papers found.