🤖 AI Summary
Existing research on security risks in code-generation large language models (CodeGen LLMs) predominantly relies on red-teaming, lacking automated, semantically aware blue-team defense mechanisms. Method: We propose BlueCodeAgent—the first end-to-end security defense framework integrating collaborative red-team and blue-team automation, combining constitutional rule reasoning, fine-grained code semantic analysis, dynamic execution verification, and multi-agent coordination to enable context-aware, multi-level detection of harmful instructions, biased content, and vulnerable code. Contribution/Results: BlueCodeAgent significantly reduces false positives, generates actionable security guidelines, and improves generalization to both known and unknown threats. Evaluated across three security tasks on four benchmark datasets, it achieves an average 12.7% F1-score improvement over baselines—particularly mitigating the over-conservatism of base models in vulnerability detection and outperforming safety-prompting approaches.
📝 Abstract
As large language models (LLMs) are increasingly used for code generation, concerns over the security risks have grown substantially. Early research has primarily focused on red teaming, which aims to uncover and evaluate vulnerabilities and risks of CodeGen models. However, progress on the blue teaming side remains limited, as developing defense requires effective semantic understanding to differentiate the unsafe from the safe. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent enabled by automated red teaming. Our framework integrates both sides: red teaming generates diverse risky instances, while the blue teaming agent leverages these to detect previously seen and unseen risk scenarios through constitution and code analysis with agentic integration for multi-level defense. Our evaluation across three representative code-related tasks--bias instruction detection, malicious instruction detection, and vulnerable code detection--shows that BlueCodeAgent achieves significant gains over the base models and safety prompt-based defenses. In particular, for vulnerable code detection tasks, BlueCodeAgent integrates dynamic analysis to effectively reduce false positives, a challenging problem as base models tend to be over-conservative, misclassifying safe code as unsafe. Overall, BlueCodeAgent achieves an average 12.7% F1 score improvement across four datasets in three tasks, attributed to its ability to summarize actionable constitutions that enhance context-aware risk detection. We demonstrate that the red teaming benefits the blue teaming by continuously identifying new vulnerabilities to enhance defense performance.