🤖 AI Summary
Large language models (LLMs) suffer from insufficient robustness evaluation against jailbreaking attacks, as existing methods predominantly rely on single natural-language strategies and thus inadequately stress-test safety alignment mechanisms.
Method: We propose a multi-strategy jailbreaking framework integrating symbolic equation solving and code completion. It encodes malicious intent as cross-domain mathematical programming tasks, inducing attention misalignment through symbolic modeling, prompt-driven code generation, and task obfuscation—operating synergistically at both the natural-language and execution layers. The method adopts a zero-shot, single-query paradigm without fine-tuning or iterative optimization.
Contribution/Results: Our approach achieves an average jailbreaking success rate of 91.19% on GPT-series models and 98.65% on three state-of-the-art open- and closed-source LLMs—substantially outperforming single-strategy baselines. It is the first to enable coordinated, dual-path attacks leveraging both cross-modal reasoning and code execution, establishing a novel paradigm for evaluating LLM trustworthiness.
📝 Abstract
Large language models (LLMs), such as ChatGPT, have achieved remarkable success across a wide range of fields. However, their trustworthiness remains a significant concern, as they are still susceptible to jailbreak attacks aimed at eliciting inappropriate or harmful responses. However, existing jailbreak attacks mainly operate at the natural language level and rely on a single attack strategy, limiting their effectiveness in comprehensively assessing LLM robustness. In this paper, we propose Equacode, a novel multi-strategy jailbreak approach for large language models via equation-solving and code completion. This approach transforms malicious intent into a mathematical problem and then requires the LLM to solve it using code, leveraging the complexity of cross-domain tasks to divert the model's focus toward task completion rather than safety constraints. Experimental results show that Equacode achieves an average success rate of 91.19% on the GPT series and 98.65% across 3 state-of-the-art LLMs, all with only a single query. Further, ablation experiments demonstrate that EquaCode outperforms either the mathematical equation module or the code module alone. This suggests a strong synergistic effect, thereby demonstrating that multi-strategy approach yields results greater than the sum of its parts.