SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses black-box backdoor attacks on customized large language model (LLM) APIs, where stealthy system prompts are injected with malicious instructions. We propose SLIP, a defense framework that integrates key-phrase-guided chain-of-thought (KCoT) reasoning with a soft-label mechanism (SLM). SLIP extracts critical phrases, quantifies semantic relevance, and suppresses trigger sensitivity and anomalous semantics via outlier-score removal and averaging—all without white-box access, relying solely on API inputs and outputs. Evaluated on classification and question-answering tasks, SLIP reduces attack success rate from 90.2% to 25.13% while preserving high accuracy on clean data. It outperforms existing defenses in both robustness and utility, offering a practical, API-level mitigation for hidden prompt injection backdoors.

Technology Category

Application Category

📝 Abstract
With the development of customized large language model (LLM) agents, a new threat of black-box backdoor attacks has emerged, where malicious instructions are injected into hidden system prompts. These attacks easily bypass existing defenses that rely on white-box access, posing a serious security challenge. To address this, we propose SLIP, a Soft Label mechanism and key-extraction-guided CoT-based defense against Instruction backdoors in APIs. SLIP is designed based on two key insights. First, to counteract the model's oversensitivity to triggers, we propose a Key-extraction-guided Chain-of-Thought (KCoT). Instead of only considering the single trigger or the input sentence, KCoT prompts the agent to extract task-relevant key phrases. Second, to guide the LLM toward correct answers, our proposed Soft Label Mechanism (SLM) prompts the agent to quantify the semantic correlation between key phrases and candidate answers. Crucially, to mitigate the influence of residual triggers or misleading content in phrases extracted by KCoT, which typically causes anomalous scores, SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores to derive a more reliable semantic representation. Extensive experiments on classification and question-answer (QA) tasks demonstrate that SLIP is highly effective, reducing the average attack success rate (ASR) from 90.2% to 25.13% while maintaining high accuracy on clean data and outperforming state-of-the-art defenses. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/SLIP.
Problem

Research questions and friction points this paper is trying to address.

Defends against black-box backdoor attacks in LLM APIs
Reduces model oversensitivity to malicious instruction triggers
Mitigates influence of misleading content in extracted phrases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-extraction-guided Chain-of-Thought (KCoT)
Soft Label Mechanism (SLM)
Excludes anomalous scores for reliability
🔎 Similar Papers
No similar papers found.
Zhengxian Wu
Zhengxian Wu
Tsinghua University
Computer Vision、Large Language Model
J
Juan Wen
College of information electrical and engineering, China Agricultural University
W
Wanli Peng
College of information electrical and engineering, China Agricultural University
H
Haowei Chang
College of information electrical and engineering, China Agricultural University
Yinghan Zhou
Yinghan Zhou
China Agricultral University
Yiming Xue
Yiming Xue
CAU
data hidingsignal processing