Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work systematically uncovers, for the first time, the security threat of knowledge base poisoning to Retrieval-Augmented Code Generation (RACG) systems: maliciously injected vulnerable code snippets can be erroneously retrieved and incorporated into generated outputs, thereby propagating vulnerabilities across projects. We propose a dual-scenario poisoning experimental paradigm and conduct empirical evaluations on a knowledge base constructed from real-world open-source repositories, integrating four mainstream large language models and two representative retrievers. Results show that single-sample poisoning can contaminate up to 48% of generated code. Furthermore, poisoning significantly degrades code security—measured via static analysis and vulnerability detection tools—and we introduce targeted mitigation strategies grounded in retrieval robustness and generation filtering. This study provides both theoretical foundations and practical guidelines for enhancing the robustness and trustworthy deployment of RACG systems.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product. This paper presents the first comprehensive study on the security risks associated with RACG systems, focusing on how vulnerable code in the knowledge base compromises the security of generated code. We investigate the LLM-generated code security across different settings through extensive experiments using four major LLMs, two retrievers, and two poisoning scenarios. Our findings highlight the significant threat of knowledge base poisoning, where even a single poisoned code example can compromise up to 48% of generated code. Our findings provide crucial insights into vulnerability introduction in RACG systems and offer practical mitigation recommendations, thereby helping improve the security of LLM-generated code in future works.

Problem

Research questions and friction points this paper is trying to address.

Assess security risks in Retrieval-Augmented Code Generation systems.

Investigate impact of poisoned code in knowledge bases.

Propose mitigation strategies for LLM-generated code vulnerabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs integration

knowledge base poisoning

vulnerability mitigation

🔎 Similar Papers

Prompting Techniques for Secure Code Generation: A Systematic Investigation

2024-07-09arXiv.orgCitations: 6

Uber

For New York, NY-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

New York, NY, USA / San Francisco, CA, USA / Seattle, WA, USA

Authors to Follow