The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

📅 2025-12-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work reveals and implements a stealthy backdoor attack targeting the tokenizer transplantation phase in large language model composition. Exploiting the geometric properties of embedding space coefficient reuse, the attack constructs “destructive tokens” that appear benign in the source model but reconstruct into highly salient malicious features in the target model—without modifying the source model’s architecture or training procedure. Through a sparsity-driven solver and spectral imitation technique, the method achieves dual-objective optimization that preserves source model performance while effectively compromising the target model’s generation behavior. The attack demonstrates strong robustness against downstream operations such as fine-tuning and weight merging, and evades current anomaly detection mechanisms.

Technology Category

Application Category

📝 Abstract

The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

Problem

Research questions and friction points this paper is trying to address.

tokenizer transplant

supply-chain vulnerability

model composition

LLM security

malicious token

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenizer transplant

supply-chain attack

model composition