Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

📅 2024-06-10
🏛️ arXiv.org
📈 Citations: 12
Influential: 2
📄 PDF
🤖 AI Summary
API-based large language models (LLMs) are vulnerable to backdoor attacks due to their reliance on untrusted third-party services. Method: This paper proposes the first lightweight black-box defense method grounded in chain-of-thought (CoT) consistency verification. It requires no access to model weights or training data; instead, it leverages natural-language prompts to elicit self-generated CoTs from the target LLM and verifies logical consistency between the CoT and the final output—enabling zero-shot, single-API-call real-time detection. Contribution/Results: The approach innovatively repurposes the LLM’s intrinsic reasoning capability as a built-in defense mechanism, circumventing traditional dependencies on internal model information or large-scale labeled data. Extensive evaluation across diverse tasks and models—including GPT-4, Claude, and Llama series—demonstrates superior detection accuracy over baselines, with performance improving as model capability increases.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific"triggers"set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes for consistency with the final output -- any inconsistencies indicating a potential attack. It is well-suited for the popular API-only LLM deployments, enabling detection at minimal cost and with little data. User-friendly and driven by natural language, it allows non-experts to perform the defense independently while maintaining transparency. We validate the effectiveness of CoS through extensive experiments on various tasks and LLMs, with results showing greater benefits for more powerful LLMs.
Problem

Research questions and friction points this paper is trying to address.

Detects backdoor attacks in API-accessible LLMs
Uses LLM reasoning to identify output inconsistencies
Provides user-friendly defense for non-technical users
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs' reasoning for backdoor detection
Guides LLM to scrutinize input-output consistency
User-friendly, API-compatible, minimal data needed
🔎 Similar Papers
No similar papers found.
X
Xi Li
The University of Alabama at Birmingham
Yusen Zhang
Yusen Zhang
PhD Student at Penn State University
Natural Language ProcessingMachine Learning
Renze Lou
Renze Lou
Pennsylvania State University
NLPLarge Language ModelsAI4ScienceZero-shot LearningGenerative AI
C
Chen Wu
The Pennsylvania State University
J
Jiaqi Wang
The Pennsylvania State University