🤖 AI Summary
Existing robustness verification methods for large language model (LLM) safety alignment rely on heuristic text perturbations, lacking systematic, model-agnostic approaches to assess vulnerability against jailbreaking. Method: We propose QueryAttack—a novel framework that treats LLMs as structured knowledge bases susceptible to adversarial queries. Instead of textual perturbations, it semantically compiles natural-language jailbreak prompts into SQL-like or programmatic structured prompts, enabling query injection without model access. Contribution/Results: QueryAttack achieves high attack success rates (ASR) across diverse models (e.g., GPT-4-1106) and vendors, demonstrating strong cross-model generalization. Comprehensive evaluation via multi-model ASR measurement and defense-aware experiments confirms its effectiveness; custom mitigation strategies reduce ASR by up to 64%. Crucially, this work redefines the jailbreaking paradigm—from “input deception” to “structured query injection”—exposing a previously underexplored security risk: LLMs’ emergent behavior as knowledge interfaces vulnerable to semantic query exploitation.
📝 Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to systematically examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into code-style structured query to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, ant the results show that QueryAttack achieves high attack success rates (ASRs) across LLMs with different developers and capabilities. We also evaluate QueryAttack's performance against common defenses, confirming that it is difficult to mitigate with general defensive techniques. To defend against QueryAttack, we tailor a defense method which can reduce ASR by up to 64% on GPT-4-1106. The code of QueryAttack can be found on https://anonymous.4open.science/r/QueryAttack-334B.