A Theoretical Game of Attacks via Compositional Skills

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Although large language models (LLMs) exhibit remarkable capabilities, they remain vulnerable to adversarial prompt attacks that can circumvent alignment-based defenses. This work presents the first formal game-theoretic model of the strategic interaction between attackers and defenders in this context, revealing an inherent advantage for the attacker. By integrating game theory, adversarial prompt modeling, and equilibrium analysis, the study derives theoretically provable optimal strategies for both attack and defense. Experimental results across multiple mainstream LLMs and benchmark datasets demonstrate that the proposed theoretically optimal attack strategy substantially outperforms existing methods, offering a rigorous theoretical foundation and practical guidance for the secure deployment of LLMs.

📝 Abstract

As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.

Problem

Research questions and friction points this paper is trying to address.

adversarial prompting

large language models

AI safety

alignment

game theory

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial prompting

game-theoretic framework

optimal defense