π€ AI Summary
Current safety-aligned large language models (LLMs) remain vulnerable to adversarial attacks that bypass alignment safeguards and elicit harmful outputs. To address this, we propose BitBypassβa novel black-box jailbreaking attack that shifts the jailbreaking paradigm from prompt engineering to the bitstream representation layer. BitBypass achieves stealthy, physical-layer obfuscation via hyphen-encoding perturbations on input text, enabling gradient-free construction of adversarial inputs. Crucially, it requires no access to model internals or output feedback, operating solely through bit-level manipulations of the input encoding. Extensive evaluations across GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral demonstrate that BitBypass significantly outperforms state-of-the-art methods in both jailbreaking success rate and stealthiness. Our results expose a critical security vulnerability at the data representation level of aligned LLMs, underscoring the need for robust alignment mechanisms resilient to low-level encoding perturbations.
π Abstract
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.