ShallowJail: Steering Jailbreaks against Large Language Models

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Aligned large language models remain vulnerable to jailbreak attacks, yet existing approaches often rely on explicit prompts or incur high computational overhead, struggling to balance stealth and efficiency. This work proposes ShallowJail, which uncovers and exploits the fragility of shallow alignment mechanisms in large language models. By manipulating the initial tokens of model inputs during inference, ShallowJail enables low-overhead, highly stealthy white-box and gray-box jailbreak attacks without requiring complex optimization. Experiments across multiple state-of-the-art large language models demonstrate that ShallowJail significantly compromises their safety guarantees, thereby validating its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs'responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
Large Language Models
alignment
harmful outputs
model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attack
shallow alignment
large language models
token manipulation
model safety
S
Shang Liu
University of Louisville, KY, USA
Hanyu Pei
Hanyu Pei
University of Louisville
deep learning
Z
Zeyan Liu
University of Louisville, KY, USA