"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the scalability risks posed by black-box jailbreaking attacks launched by non-expert users. We propose GTA, a game-theoretic adversarial framework that models LLM safety alignment as a finite-horizon stochastic game. Grounded in the “template-over-security-reversal” hypothesis, GTA introduces an adaptive pressure-exerting attacker agent, integrated with quantized response reparameterization, word-level insertion detection evasion, multi-protocol decoding, and multilingual optimization. To our knowledge, this is the first systematic application of classical behavioral game theory to LLM jailbreaking—enabling interpretable, dynamically evolving attack strategies. Evaluated on mainstream models including Deepseek-R1, GTA achieves >95% success rates, effectively bypasses defenses such as PromptGuard, and demonstrates robust efficacy in real-world Hugging Face deployments. These results underscore the critical need for long-term, proactive security monitoring of deployed LLMs.

Technology Category

Application Category

📝 Abstract

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

Problem

Research questions and friction points this paper is trying to address.

Proposes Game-Theory Attack framework for scalable black-box LLM jailbreaking

Formulates attacker-LLM interaction as sequential game to weaken safety constraints

Achieves high attack success rates across multiple protocols and real-world applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes attacker interaction as sequential stochastic game

Reparameterizes LLM outputs via quantal response method

Introduces behavioral conjecture template-over-safety flip

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation