Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing code security evaluation benchmarks focus narrowly on single tasks (e.g., code completion or generation), failing to holistically assess large language models’ (LLMs) capabilities across security-critical dimensions—including secure code generation, vulnerability repair, and detection. Method: We introduce CoV-Eval, the first multi-task benchmark for code security, encompassing four tasks: vulnerability repair, detection, classification, and secure code completion. We further propose VC-Judge, an expert-aligned automated evaluator that synergistically integrates rule-based reasoning with LLM-based judgment, achieving high human agreement (Cohen’s κ = 0.89) and efficiency. Contribution/Results: A systematic evaluation of 20 state-of-the-art models on CoV-Eval reveals that current LLMs excel at vulnerability identification but lag significantly in secure code generation and precise repair—especially for type-specific vulnerabilities (e.g., TOCTOU races and logical flaws). Our work provides empirically grounded insights and a reproducible, task-diverse evaluation infrastructure for advancing code security research.

Technology Category

Application Category

📝 Abstract

Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' code security across multiple tasks

Assessing vulnerability repair and detection capabilities

Identifying challenges in generating secure code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed CoV-Eval benchmark for multi-task security evaluation

Developed VC-Judge model for reliable vulnerability review

Evaluated 20 LLMs on code security comprehensively

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?