🤖 AI Summary
This study addresses the security vulnerabilities of large language models (LLMs) in multilingual settings, particularly their susceptibility to jailbreak attacks when processing Classical Chinese, which can circumvent existing safety mechanisms. To tackle this issue, the authors propose CC-BOS, a novel framework that leverages the conciseness and obscurity of Classical Chinese to construct adversarial prompts. CC-BOS integrates an eight-dimensional strategy encoding scheme with an enhanced fruit fly optimization algorithm—incorporating olfactory search, visual search, and Cauchy mutation—to efficiently generate attack samples under black-box conditions. Additionally, it introduces a Classical Chinese–English translation module to enable cross-lingual evaluation. Experimental results demonstrate that CC-BOS significantly outperforms state-of-the-art methods across multiple mainstream LLMs, achieving substantially higher attack success rates.
📝 Abstract
As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.