Adversarial Tokenization

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel adversarial attack vector—“adversarial tokenization”—which exploits the non-uniqueness of subword tokenization to bypass LLM safety filters and alignment mechanisms without altering the input string. Specifically, it leverages alternative, syntactically valid tokenization paths that preserve surface-form equivalence but induce divergent internal representations. Methodologically, the approach systematically enumerates the subword tokenization space under semantic consistency constraints and employs black-box prompt perturbation for optimization. Empirical evaluation across state-of-the-art models—including Llama-3—and benchmarks such as AdvBench demonstrates jailbreaking success rates competitive with SOTA textual adversarial methods. Crucially, this is the first work to empirically expose a critical security blind spot at the tokenization layer, challenging the prevailing text-centric alignment paradigm. It establishes tokenization-level robustness as a fundamental dimension of LLM security, opening a new research direction for adversarial robustness and alignment verification.

Technology Category

Application Category

📝 Abstract
Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.
Problem

Research questions and friction points this paper is trying to address.

Explores adversarial tokenization in LLMs
Tests if malicious strings evade safety restrictions
Identifies vulnerabilities in subword tokenization models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial tokenization exploits LLM vulnerabilities.
Alternative tokenizations evade safety restrictions effectively.
Empirical validation across top LLMs reveals new risks.