SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation

📅 2024-12-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work presents the first systematic investigation into the security vulnerabilities of chain-of-thought (CoT) code generation models under backdoor attacks. To address the lack of dedicated attack methodologies for CoT models, we propose SABER—a model-agnostic, self-attention-driven backdoor attack framework. SABER leverages CodeBERT to identify critical poisoned tokens via self-attention score analysis, synthesizes naturalistic triggers through code mutation and user-behavior modeling, and integrates ONION-based detection evasion with human evaluation. It constitutes the first general-purpose backdoor attack targeting the CoT reasoning process. On HumanEval-CoT, SABER achieves an attack success rate of 80.95%, outperforming RIPPLe by +33.33%. Moreover, it evades 61.90% of automated detection tools, while human evaluators correctly identified only 3.17% of triggered samples—demonstrating exceptional stealth and practical threat potential.

Technology Category

Application Category

📝 Abstract
Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to further enhance the reliability of Code Language Models (CLMs) in generating code, a step-by-step approach that breaks down complex programming tasks into manageable sub-problems. Advances in this area have introduced CoT models, specifically designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation. Despite these advancements, the security of CoT models has not been systematically studied. In this study, we aim to fill this gap by investigating the vulnerability of CoT models to backdoor injection in code generation tasks. To address this, we propose a model-agnostic backdoor attack method SABER (Self-Attention-BasEd backdooR) based on the self-attention mechanism. SABER begins by selecting a malicious output as the backdoor using code mutation operations. It then identifies the tokens most relevant to poisoned content by analyzing self-attention scores in the CodeBERT model. Finally, it mimicks user behavior to generate adaptive and natural triggers. Our experiments on HumanEval-CoT and OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor attacks via data poisoning. Taking the HumanEval-CoT dataset as an example, SABER achieves an ASR of 80.95%, representing an improvement of 33.33% over RIPPLe and a substantial 4.76% enhancement compared to BadPre. Further evaluations using ONION for automated detection and human studies reveal that SABER is stealthier and harder to detect, bypassing 61.90% of automated detection, with a human detection rate of just 3.17%. Our findings reveal that backdoors can be injected into CoT models to manipulate downstream code generation tasks. This highlights the urgent need for further research to understand and mitigate the security vulnerabilities in CoT models.
Problem

Research questions and friction points this paper is trying to address.

Investigates vulnerability of CoT models to backdoor attacks.
Proposes SABER, a model-agnostic backdoor attack method.
Demonstrates CoT models' susceptibility to data poisoning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic backdoor attack method SABER
Self-attention mechanism for token analysis
Adaptive triggers mimicking user behavior
🔎 Similar Papers
No similar papers found.
N
Naizhu Jin
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, Jiangsu, China.
Z
Zhong Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, Jiangsu, China.
Y
Yinggang Guo
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, Jiangsu, China.
Chao Su
Chao Su
Beijing Institute of Technology
Natural Language ProcessingMachine Translation
T
Tian Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, Jiangsu, China.
Qingkai Zeng
Qingkai Zeng
Assistant Professor, Nankai University; University of Notre Dame
data miningnatural language processingknowledge graphlarge language models