🤖 AI Summary
Existing evaluations of code security agents rely on manual vulnerability reproduction, which is costly, unscalable, and suffers from data lag. This work proposes the first multi-agent collaborative framework that automatically transforms sparse CVE metadata into executable, expert-level vulnerability repair tasks, establishing LiveCVEBench—a continuously updated benchmark—and synthesizing over a thousand training environments. By integrating automated CVE parsing, executable environment generation, and fine-tuning of Qwen3-32B, the approach achieves strong performance in task correctness (95%), environment fidelity (96%), and real-world vulnerability repair success rate (66.2%). Fine-tuned models show a substantial improvement on LiveCVEBench, with performance rising from 5.3% to 35.8%, surpassing Claude 4.5 Sonnet and advancing the scalable development of code security agents.
📝 Abstract
Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .