🤖 AI Summary
API-based large language models (LLMs) are vulnerable to backdoor attacks due to their reliance on untrusted third-party services. Method: This paper proposes the first lightweight black-box defense method grounded in chain-of-thought (CoT) consistency verification. It requires no access to model weights or training data; instead, it leverages natural-language prompts to elicit self-generated CoTs from the target LLM and verifies logical consistency between the CoT and the final output—enabling zero-shot, single-API-call real-time detection. Contribution/Results: The approach innovatively repurposes the LLM’s intrinsic reasoning capability as a built-in defense mechanism, circumventing traditional dependencies on internal model information or large-scale labeled data. Extensive evaluation across diverse tasks and models—including GPT-4, Claude, and Llama series—demonstrates superior detection accuracy over baselines, with performance improving as model capability increases.
📝 Abstract
Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific"triggers"set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes for consistency with the final output -- any inconsistencies indicating a potential attack. It is well-suited for the popular API-only LLM deployments, enabling detection at minimal cost and with little data. User-friendly and driven by natural language, it allows non-experts to perform the defense independently while maintaining transparency. We validate the effectiveness of CoS through extensive experiments on various tasks and LLMs, with results showing greater benefits for more powerful LLMs.