RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the critical gap in evaluating large language models (LLMs) for generating code that is both functionally correct and secure, particularly in high-risk contexts. We introduce the first benchmark specifically designed for secure code generation in Java, derived from real-world software repositories and encompassing 19 Common Weakness Enumeration (CWE) categories across 105 complex instances with intricate data-flow dependencies. The benchmark’s quality is ensured through a multi-stage validation pipeline combining CodeQL static analysis, LLM-assisted false-positive filtering, and expert review. We propose SecurePass@K, a novel metric that jointly assesses functional correctness and security. Empirical evaluation reveals that current mainstream LLMs exhibit limited capability in secure code generation; while retrieval-augmented generation (RAG) improves functionality, it offers negligible security benefits, and generic safety prompts not only fail to reliably prevent vulnerabilities but often introduce compilation errors.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and exhibiting a wide diversity of data flow complexities, including vulnerabilities with up to 34-hop inter-procedural dependencies. Using RealSec-bench, we conduct an extensive empirical study on 5 popular LLMs. We introduce a novel composite metric, SecurePass@K, to assess both functional correctness and security simultaneously. We find that while Retrieval-Augmented Generation (RAG) techniques can improve functional correctness, they provide negligible benefits to security. Furthermore, explicitly prompting models with general security guidelines often leads to compilation failures, harming functional correctness without reliably preventing vulnerabilities. Our work highlights the gap between functional and secure code generation in current LLMs.

Problem

Research questions and friction points this paper is trying to address.

secure code generation

large language models

software security

real-world vulnerabilities

code correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

RealSec-bench

secure code generation

LLM evaluation