A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language model (LLM) safety alignment mechanisms—particularly reinforcement learning from human feedback (RLHF) and content filters—against latent jailbreaking attacks. We propose HILL, a novel jailbreaking method that employs hypothetical reconstruction to transform harmful instructions into pedagogical question formats, exploiting models’ helpfulness bias for efficient alignment circumvention. Its key innovation lies in the first systematic embedding of jailbreaking intent within educational discourse, coupled with two newly designed evaluation metrics that expose the fundamental failure of existing safeguards under learning-style prompting. Experiments on AdvBench demonstrate HILL’s state-of-the-art attack success rate across diverse LLMs and malicious intent categories, exhibiting both high efficiency and strong generalization. Crucially, most mainstream defenses not only fail to mitigate HILL but inadvertently amplify its efficacy, revealing critical weaknesses in current safety paradigms.

Technology Category

Application Category

📝 Abstract
Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of balancing helpfulness and safety alignments.
Problem

Research questions and friction points this paper is trying to address.

Develops novel jailbreak method exploiting LLMs' helpfulness
Systematically transforms harmful requests into learning-style questions
Exposes vulnerabilities in safety mechanisms through learning-style elicitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hiding harmful requests as learning-style questions
Using straightforward hypotheticality indicators for transformation
Introducing new metrics for jailbreak utility evaluation
🔎 Similar Papers
No similar papers found.
X
Xuan Luo
Harbin Institute of Technology, Shenzhen
Y
Yue Wang
Shenzhen University
Z
Zefeng He
Shenzhen University
Geng Tu
Geng Tu
Harbin Institute of Technology (Shenzhen)
NLPdeep learningtext mining
J
Jing Li
Hong Kong Polytechnic University
Ruifeng Xu
Ruifeng Xu
Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingAffective ComputingArgumentation MiningLLMsBioinformatics