A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based code security evaluation benchmarks suffer from three critical limitations: (1) reliance on isolated code snippets, (2) lack of reproducibility, and (3) neglect of the interplay between input context and output security. To address these, we introduce A.S.E—the first AI-generated code security evaluation benchmark grounded in real-world software repositories. A.S.E fully preserves contextual elements such as build systems and cross-file dependencies, enabling repository-level security assessment. Its novel containerized, reproducible framework jointly audits outputs via static analysis, dynamic build validation, and expert-defined multidimensional criteria—namely security correctness, build quality, and generation stability. Evaluation is conducted on CVE-grounded, realistic tasks. Results show Claude-3.7-Sonnet achieves the best overall performance, while Qwen3-235B-A22B-Instruct attains the highest security score. Notably, a “fast-thinking” decoding strategy significantly outperforms complex reasoning methods in security patching tasks.

Technology Category

Application Category

📝 Abstract
The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.
Problem

Research questions and friction points this paper is trying to address.

Evaluating security in AI-generated code at repository level
Addressing inadequate benchmarks for secure code generation
Connecting input context quality with output security
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repository-level benchmark with real-world CVEs
Containerized evaluation framework for reproducibility
Expert-defined rules for stable security assessments
🔎 Similar Papers
No similar papers found.
K
Keke Lian
Tencent
B
Bing Wang
Peking University
L
Lei Zhang
Fudan University
L
Libo Chen
Shanghai Jiao Tong University
J
Junjie Wang
Tsinghua University
Ziming Zhao
Ziming Zhao
Zhejiang University
Encrypted traffic analysisAdversarial examplesQuantum computing
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision
H
Haotong Duan
Tencent
H
Haoran Zhao
Fudan University
S
Shuang Liao
Fudan University
M
Mingda Guo
Fudan University
J
Jiazheng Quan
Peking University
Y
Yilu Zhong
Peking University
C
Chenhao He
Shanghai Jiao Tong University
Z
Zichuan Chen
Shanghai Jiao Tong University
J
Jie Wu
Tsinghua University
Haoling Li
Haoling Li
Tsinghua University, MSRA
Zhaoxuan Li
Zhaoxuan Li
Institute of Information Engineering, Chinese Academy of Sciences
Jiongchi Yu
Jiongchi Yu
Singapore Management University
Software EngineeringSecurity
H
Hui Li
Peking University
D
Dong Zhang
Tencent