🤖 AI Summary
To address the rigidity and poor adaptability of existing jailbreaking strategies against black-box large language models (LLMs), this paper proposes a Markov chain-based adaptive jailbreaking framework. It formalizes diverse obfuscation strategies as states in a stochastic transition process, dynamically maintaining a strategy pool and updating transition probabilities in real time based on attack feedback—enabling online optimization of strategy selection and fusion. Integrating static prompt engineering with dynamic feedback, the framework supports multi-round iterative attacks. Evaluated on mainstream black-box LLMs—including GPT-4o and Gemini-2.0-flash—it achieves over 90% success rates with fewer than 15 average queries per attack, significantly outperforming state-of-the-art black-box jailbreaking methods. The core contribution lies in the first formalization of jailbreaking strategy evolution as a Markov decision process, thereby enhancing generalizability, robustness, and query efficiency of adversarial attacks.
📝 Abstract
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90% attack success rate with fewer than 15 queries per attempt on average.