Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code security evaluation benchmarks focus narrowly on single tasks (e.g., code completion or generation), failing to holistically assess large language models’ (LLMs) capabilities across security-critical dimensions—including secure code generation, vulnerability repair, and detection. Method: We introduce CoV-Eval, the first multi-task benchmark for code security, encompassing four tasks: vulnerability repair, detection, classification, and secure code completion. We further propose VC-Judge, an expert-aligned automated evaluator that synergistically integrates rule-based reasoning with LLM-based judgment, achieving high human agreement (Cohen’s κ = 0.89) and efficiency. Contribution/Results: A systematic evaluation of 20 state-of-the-art models on CoV-Eval reveals that current LLMs excel at vulnerability identification but lag significantly in secure code generation and precise repair—especially for type-specific vulnerabilities (e.g., TOCTOU races and logical flaws). Our work provides empirically grounded insights and a reproducible, task-diverse evaluation infrastructure for advancing code security research.

Technology Category

Application Category

📝 Abstract
Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' code security across multiple tasks
Assessing vulnerability repair and detection capabilities
Identifying challenges in generating secure code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed CoV-Eval benchmark for multi-task security evaluation
Developed VC-Judge model for reliable vulnerability review
Evaluated 20 LLMs on code security comprehensively
🔎 Similar Papers
No similar papers found.
Yutao Mou
Yutao Mou
Peking University
AI SafetyLLM Alignment
Xiao Deng
Xiao Deng
Peking University
Vulnerability Detection
Y
Yuxiao Luo
National Engineering Research Center for Software Engineering, Peking University, China
Shikun Zhang
Shikun Zhang
北京大学
W
Wei Ye
National Engineering Research Center for Software Engineering, Peking University, China