Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the security risks of “vibe coding”—an emerging LLM-based agent programming paradigm—when generating code for real-world open-source tasks. Method: We introduce SU S VI B E S, the first production-oriented benchmark comprising 200 realistic, vulnerability-prone programming tasks derived from actual open-source projects, and evaluate leading programming agents (e.g., SWE-Agent, Claude 3 Sonnet). Contribution/Results: While 61% of generated solutions are functionally correct, only 10.5% simultaneously satisfy security requirements—revealing a severe security gap in current agents. Lightweight mitigation strategies, such as vulnerability-aware prompting, yield marginal improvements. This work provides the first empirical evidence of a critical decoupling between functional correctness and security in LLM programming agents, establishing both foundational insights and an essential evaluation infrastructure for developing trustworthy, secure coding assistants.

Technology Category

Application Category

📝 Abstract
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking security vulnerabilities in agent-generated code from real-world tasks
Evaluating coding agents' poor performance in producing secure software solutions
Assessing the risks of deploying vibe coding outputs in security-sensitive applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking real-world coding tasks for vulnerabilities
Evaluating multiple LLM agents on security performance
Testing vulnerability hint augmentation as mitigation strategy
🔎 Similar Papers
No similar papers found.
S
Songwen Zhao
Carnegie Mellon University, Language Technologies Institute
Danqing Wang
Danqing Wang
Carnegie Mellon University
Natural Language ProcessingDrug Discovery
Kexun Zhang
Kexun Zhang
Carnegie Mellon University
J
Jiaxuan Luo
Carnegie Mellon University, Language Technologies Institute
Z
Zhuo Li
HydroX AI
L
Lei Li
Carnegie Mellon University, Language Technologies Institute