🤖 AI Summary
Neural Code Models (NCMs) are vulnerable to backdoor attacks in critical tasks such as vulnerability detection, posing serious threats to code security. To address this, we propose EliBadCode—a novel backdoor mitigation framework that, for the first time, incorporates programming-language naming conventions into trigger word filtering. It integrates sample-adaptive trigger localization, greedy coordinate gradient-based reverse trigger optimization with anchor guidance, and model-level unlearning—achieving precise removal of stealthy backdoors without requiring access to the original training data or any attack prior knowledge. The method ensures high purification rates while minimizing performance degradation on primary tasks. Extensive experiments across multiple NCMs and three security-sensitive code understanding tasks demonstrate >95% backdoor removal rate and <1.2% main-task accuracy drop—significantly outperforming existing detoxification approaches.
📝 Abstract
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs. To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.