🤖 AI Summary
Existing code security evaluation benchmarks focus narrowly on single tasks (e.g., code completion or generation), failing to holistically assess large language models’ (LLMs) capabilities across security-critical dimensions—including secure code generation, vulnerability repair, and detection.
Method: We introduce CoV-Eval, the first multi-task benchmark for code security, encompassing four tasks: vulnerability repair, detection, classification, and secure code completion. We further propose VC-Judge, an expert-aligned automated evaluator that synergistically integrates rule-based reasoning with LLM-based judgment, achieving high human agreement (Cohen’s κ = 0.89) and efficiency.
Contribution/Results: A systematic evaluation of 20 state-of-the-art models on CoV-Eval reveals that current LLMs excel at vulnerability identification but lag significantly in secure code generation and precise repair—especially for type-specific vulnerabilities (e.g., TOCTOU races and logical flaws). Our work provides empirically grounded insights and a reproducible, task-diverse evaluation infrastructure for advancing code security research.
📝 Abstract
Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.