UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models (LLMs) face diverse yet unified security threats—including prompt injection, backdoor, and adversarial attacks—that lack a cohesive characterization and detection paradigm. Method: This paper introduces Prompt Trigger Attacks (PTA) as a unified threat model and proposes the first end-to-end framework for detecting such attacks. It employs a lightweight, internal-representation-consistency-based detection head that operates in a single forward pass, enabling simultaneous attack detection and text generation without fine-tuning or additional training—achieving plug-and-play deployment. Contribution/Results: The work establishes the first unified PTA attack taxonomy and introduces a zero-training-overhead, single-forward-pass detection mechanism. Evaluated across multiple attack types, the framework achieves an average detection accuracy of 98.2% with less than 3% inference latency overhead, and maintains compatibility with mainstream open-source and commercial LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Detecting prompt injection in LLMs

Unified defense against backdoor attacks

Identifying adversarial attacks in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified defense mechanism for LLMs

Single-forward strategy for detection

Detects multiple prompt trigger attacks

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models