SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM security evaluation benchmarks predominantly rely on synthetic data, failing to capture the complexity and ambiguity inherent in real-world software security tasks. Method: We propose the first fully automated evaluation framework targeting authentic vulnerability scenarios, covering two core tasks: proof-of-concept (PoC) generation and vulnerability repair. Our approach introduces a novel multi-agent collaborative scaffold that automatically constructs reproducible vulnerability repositories, generates gold-standard patches and test cases, and integrates automated sandboxing and vulnerability harnesses. Contribution/Results: The framework enables high-quality, realistic dataset construction at just $0.87 per instance. Evaluated on our comprehensive benchmark, state-of-the-art LLM-based code agents achieve only 18.0% success rate on PoC generation and 34.0% on vulnerability repair—revealing a substantial gap between current capabilities and industrial deployment readiness.

Technology Category

Application Category

📝 Abstract
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents on real-world software security tasks
Addressing gaps in synthetic benchmarks for security engineering
Automating vulnerability dataset creation and agent performance assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmarking framework for security tasks
Multi-agent scaffold for vulnerability reproduction
Low-cost generation of reproducible vulnerability datasets
🔎 Similar Papers
No similar papers found.