SecCodeBench-V2 Technical Report

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of publicly available, industry-oriented evaluation benchmarks for assessing the ability of large language models (LLMs) to generate secure code. To this end, we introduce SecCodeBench-V2, a function-level benchmark for secure code generation and repair that spans five programming languages and 22 Common Weakness Enumeration (CWE) vulnerability categories, comprising 98 real-world vulnerability scenarios along with corresponding proof-of-concept (PoC) test cases. SecCodeBench-V2 is the first benchmark to integrate dynamic execution validation, expert review, and an LLM-as-a-judge mechanism. By modeling tasks at the function level and employing Pass@K scoring aggregation, it enables fine-grained, cross-language, and reproducible evaluation of AI programming assistants in terms of both security and functional correctness.

Technology Category

Application Category

📝 Abstract
We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and Node.js. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.
Problem

Research questions and friction points this paper is trying to address.

secure code generation
LLM evaluation
software security
CWE
code benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

secure code generation
LLM evaluation benchmark
dynamic execution
LLM-as-a-judge
Pass@K scoring
🔎 Similar Papers
No similar papers found.
Longfei Chen
Longfei Chen
ShanghaiTech University
Data Visualization
Ji Zhao
Ji Zhao
PhD, Huazhong University of Science and Technology
Computer VisionMachine LearningRobotics
L
Lanxiao Cui
Alibaba Group, Hangzhou, China
T
Tong Su
Alibaba Group, Hangzhou, China
Xingbo Pan
Xingbo Pan
Tsinghua University
quantum network codingquantum homomorphic encryptionquantum communication
Ziyang Li
Ziyang Li
Johns Hopkins University
Programming LanguagesMachine Learning
Y
Yongxing Wu
Alibaba Group, Hangzhou, China
Q
Qijiang Cao
Alibaba Group, Hangzhou, China
Q
Qiyao Cai
Alibaba Group, Hangzhou, China
J
Jing Zhang
Alibaba Group, Hangzhou, China
Y
Yuandong Ni
Alibaba Group, Hangzhou, China
J
Junyao He
Alibaba Group, Hangzhou, China
Z
Zeyu Zhang
Alibaba Group, Hangzhou, China
C
Chao Ge
Alibaba Group, Hangzhou, China
X
Xuhuai Lu
Alibaba Group, Hangzhou, China
Zeyu Gao
Zeyu Gao
Tsinghua University
Yuxin Cui
Yuxin Cui
Tsinghua university
W
Weisen Chen
Alibaba Group, Hangzhou, China
Y
Yuxuan Peng
Alibaba Group, Hangzhou, China
S
Shengping Wang
Alibaba Group, Hangzhou, China
Q
Qi Li
Alibaba Group, Hangzhou, China
Y
Yukai Huang
Alibaba Group, Hangzhou, China
Y
Yukun Liu
Alibaba Group, Hangzhou, China
Tuo Zhou
Tuo Zhou
The University of Hong Kong (HKU)
Large Language ModelMulti-Agents System
Terry Yue Zhuo
Terry Yue Zhuo
Researcher
Large Language ModelsCode GenerationAI4SECybersecurity