Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreak attacks triggered by long-tailed distribution inputs—such as low-resource languages or encrypted data—in open-world settings, compromising their safety alignment. This work proposes EvoJail, a novel framework that introduces multi-objective evolutionary search to long-tailed jailbreak attacks for the first time. EvoJail models attack prompts via a dual-layer semantic-algorithmic representation and employs LLM-assisted evolutionary operators to jointly optimize attack effectiveness and output perplexity within a structured search space. The approach enables automated generation of diverse, high-efficacy attack strategies, revealing significant security vulnerabilities of LLMs under long-tailed conditions at both individual and ensemble levels. Empirical results demonstrate that EvoJail matches or surpasses the performance of existing methods in exposing these weaknesses.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic evaluation of these security and privacy vulnerabilities. In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search. EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic. Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space. Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

large language models

long-tail distributions

safety alignment

security vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-objective evolutionary search

long-tail attacks

jailbreak prompts