Involuntary Jailbreak

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel security vulnerability in large language models (LLMs)—*involuntary jailbreaking*: models systematically bypass safety guardrails and generate prohibited sensitive content or in-depth responses when prompted with a single, generic instruction—without adversarial optimization, model modification, or attacker-specific targeting. Unlike conventional *targeted* jailbreaking, this represents the first formalization of an *untargeted jailbreaking paradigm*, exposing fundamental structural weaknesses in current alignment mechanisms. Using lightweight prompt engineering, the authors consistently reproduce the phenomenon across state-of-the-art models—including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT-4.1—demonstrating its cross-model generality. Key contributions include: (1) formal definition of this new attack class; (2) development of a scalable, automated detection benchmark; and (3) empirical evidence underscoring critical gaps in robust alignment, thereby informing future safety research and mitigation strategies.

Technology Category

Application Category

📝 Abstract
In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term extbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for extit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.
Problem

Research questions and friction points this paper is trying to address.

Exposing a new vulnerability called involuntary jailbreak in LLMs
Revealing the fragility of entire LLM guardrail structures
Demonstrating universal prompt bypasses safety in leading models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal prompt induces involuntary jailbreak
Generates rejected questions with detailed responses
Reveals fragility in entire guardrail structure
🔎 Similar Papers
No similar papers found.
Y
Yangyang Guo
National University of Singapore
Yangyan Li
Yangyan Li
Alibaba Group
Computer VisionComputer Graphics
M
Mohan Kankanhalli
National University of Singapore