ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work identifies and names ChatBug—a systematic, previously unrecognized security vulnerability in large language model (LLM) alignment—arising from asymmetric dialogue template formatting: while models are rigidly constrained to follow templates, user inputs remain unconstrained, enabling adversarial prompts to bypass safety alignment mechanisms. Method: We conduct structural analysis of dialogue templates, adversarial prompt engineering, and empirical evaluation across eight state-of-the-art aligned LLMs; complemented by ablation studies to assess mitigation strategies. Contribution/Results: ChatBug is successfully triggered across all evaluated models, substantially increasing jailbreak success rates. Ablation experiments reveal that adversarial training mitigates the vulnerability but incurs significant degradation in model utility. Our findings expose a fundamental security tension between template rigidity in instruction tuning and user input flexibility, unify the underlying risk across prevalent jailbreaking methods, and quantitatively demonstrate an inherent, non-negligible trade-off between safety and performance.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research
Problem

Research questions and friction points this paper is trying to address.

Language Models
ChatBug
Security vs Usability
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChatBug
Adversarial Training
Security-Performance Tradeoff
🔎 Similar Papers
No similar papers found.