SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing LLM security evaluation benchmarks predominantly rely on synthetic data, failing to capture the complexity and ambiguity inherent in real-world software security tasks. Method: We propose the first fully automated evaluation framework targeting authentic vulnerability scenarios, covering two core tasks: proof-of-concept (PoC) generation and vulnerability repair. Our approach introduces a novel multi-agent collaborative scaffold that automatically constructs reproducible vulnerability repositories, generates gold-standard patches and test cases, and integrates automated sandboxing and vulnerability harnesses. Contribution/Results: The framework enables high-quality, realistic dataset construction at just $0.87 per instance. Evaluated on our comprehensive benchmark, state-of-the-art LLM-based code agents achieve only 18.0% success rate on PoC generation and 34.0% on vulnerability repair—revealing a substantial gap between current capabilities and industrial deployment readiness.

Technology Category

Application Category

📝 Abstract

Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents on real-world software security tasks

Addressing gaps in synthetic benchmarks for security engineering

Automating vulnerability dataset creation and agent performance assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmarking framework for security tasks

Multi-agent scaffold for vulnerability reproduction

Low-cost generation of reproducible vulnerability datasets

🔎 Similar Papers

No similar papers found.

Uber

For New York, NY-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

New York, NY, USA / San Francisco, CA, USA / Seattle, WA, USA

Authors to Follow