Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing robustness verification methods for large language model (LLM) safety alignment rely on heuristic text perturbations, lacking systematic, model-agnostic approaches to assess vulnerability against jailbreaking. Method: We propose QueryAttack—a novel framework that treats LLMs as structured knowledge bases susceptible to adversarial queries. Instead of textual perturbations, it semantically compiles natural-language jailbreak prompts into SQL-like or programmatic structured prompts, enabling query injection without model access. Contribution/Results: QueryAttack achieves high attack success rates (ASR) across diverse models (e.g., GPT-4-1106) and vendors, demonstrating strong cross-model generalization. Comprehensive evaluation via multi-model ASR measurement and defense-aware experiments confirms its effectiveness; custom mitigation strategies reduce ASR by up to 64%. Crucially, this work redefines the jailbreaking paradigm—from “input deception” to “structured query injection”—exposing a previously underexplored security risk: LLMs’ emergent behavior as knowledge interfaces vulnerable to semantic query exploitation.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to systematically examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into code-style structured query to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, ant the results show that QueryAttack achieves high attack success rates (ASRs) across LLMs with different developers and capabilities. We also evaluate QueryAttack's performance against common defenses, confirming that it is difficult to mitigate with general defensive techniques. To defend against QueryAttack, we tailor a defense method which can reduce ASR by up to 64% on GPT-4-1106. The code of QueryAttack can be found on https://anonymous.4open.science/r/QueryAttack-334B.

Problem

Research questions and friction points this paper is trying to address.

Bypassing LLM safety alignment

Exploiting code-style structured queries

Achieving high attack success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-style structured query translation

Systematic safety alignment examination

Tailored defense method reduction

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance