Trojan Detection Through Pattern Recognition for Large Language Models

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses the challenge of detecting stealthy Trojan backdoors in large language models (LLMs). We propose a multi-stage black-box detection framework: (1) filtering suspicious tokens, (2) jointly employing beam search and greedy decoding for dual-path trigger inversion, and (3) validating trigger authenticity via semantic-preserving prompt engineering and targeted input perturbation. Our key contributions are the first dual-path black-box inversion mechanism and a semantic consistency verification strategy, which effectively distinguish genuine backdoors from false-positive adversarial samples. Experiments on the TrojAI and RLHF-poisoned model benchmarks demonstrate that our method improves trigger detection rate by 27.3%, reduces false positive rate by 41.6%, and increases verification accuracy by over 35%.

Technology Category

Application Category

📝 Abstract

Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.

Problem

Research questions and friction points this paper is trying to address.

Trojan detection

Large language models

Misbehavior prevention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pattern Recognition

Trojan Detection

Large Language Models

🔎 Similar Papers

No similar papers found.