Inverting Trojans in LLMs

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reverse-engineering backdoor triggers in large language models (LLMs) is challenging due to input discreteness, combinatorial explosion (~30,000ᵏ candidates), and heavy reliance on explicit keyword blacklists. Method: We propose a blacklist-free, sample-efficient reverse-engineering method that implicitly constructs a blacklist by measuring average cosine similarity in hidden-layer activation space—thereby suppressing false positives—and integrates greedy gain-guided discrete search with high-confidence misclassification criteria to accurately reconstruct trigger phrases using only a small set of clean samples. Contribution/Results: Extensive experiments across diverse backdoor attack settings demonstrate substantial improvements in recall and precision, with false positive rates reduced by up to 42%. The method achieves strong effectiveness, robustness against varying attack configurations, and minimal dependency on clean data—requiring only a few unlabeled clean examples.

Technology Category

Application Category

📝 Abstract
While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases.
Problem

Research questions and friction points this paper is trying to address.

Inverting backdoor triggers in large language models
Overcoming discrete input space and combinatorial search challenges
Detecting triggers without relying on predefined blacklists
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete search with greedy trigger accretion
Implicit blacklisting via activation similarity
Detection through high-confidence misclassifications
🔎 Similar Papers
No similar papers found.
Z
Zhengxing Li
School of EECS, Penn State
G
Guangmingmei Yang
School of EECS, Penn State
J
Jayaram Raghuram
Anomalee Inc.
D
David J. Miller
School of EECS, Penn State
George Kesidis
George Kesidis
Professor of Computer Science and Engineering, Pennsylvania State University; Professor of Electrical Engineering, Pennsylvania
networkingsecuritymachine learningoptimizationstochastic processes