Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing jailbreaking attacks against large language models (LLMs) suffer from insufficient stealth and heavy reliance on manual prompt engineering. Method: This paper proposes PASS, a novel framework that jointly models semantic and syntactic formalization to enhance semantic equivalence and syntactic undetectability of jailbreaking prompts; integrates reinforcement learning for end-to-end automated prompt optimization; and incorporates GraphRAG to improve contextual awareness and multi-hop reasoning. Contribution/Results: Evaluated on multiple mainstream open-source LLMs, PASS achieves a 32.7% average improvement in jailbreaking success rate and bypasses leading alignment-based defenses with over 89% efficacy, demonstrating superior stealth. The results systematically expose critical weaknesses in current value-alignment mechanisms—particularly their lack of formal robustness and inadequate defense against dynamic contextual adversarial perturbations.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (underline{P}rompt Junderline{a}ilbreaking via underline{S}emantic and underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

Problem

Research questions and friction points this paper is trying to address.

Developing reinforcement learning framework for LLM prompt jailbreaking

Formalizing prompts to bypass existing alignment defenses

Structuring jailbreak outputs to enhance attack effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning transforms prompts into formalized descriptions

GraphRAG system structures jailbreak outputs with contextual input

Semantic and structural formalization bypasses existing alignment defenses

🔎 Similar Papers

No similar papers found.

Authors to Follow