🤖 AI Summary
This work addresses a critical security blind spot in the use of large language models (LLMs) for automated design of intelligent optimization algorithms, where such models are vulnerable to jailbreak attacks that elicit harmful content. The study presents the first systematic investigation of this risk, introducing MOBjailbreak—the first jailbreak method tailored specifically for optimization algorithm requests—and constructing MalOptBench, a benchmark comprising 60 malicious prompts. Evaluations across 13 mainstream LLMs reveal an average attack success rate of 83.59% and a harmfulness score of 4.28 out of 5. Furthermore, existing plugin-based defense mechanisms prove largely ineffective and prone to triggering excessive safety behaviors. This research uncovers a novel security threat in automated algorithm design and offers crucial insights for the secure deployment of LLMs.
📝 Abstract
The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.