Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study systematically evaluates the security risks of “vibe coding”—an emerging LLM-based agent programming paradigm—when generating code for real-world open-source tasks. Method: We introduce SU S VI B E S, the first production-oriented benchmark comprising 200 realistic, vulnerability-prone programming tasks derived from actual open-source projects, and evaluate leading programming agents (e.g., SWE-Agent, Claude 3 Sonnet). Contribution/Results: While 61% of generated solutions are functionally correct, only 10.5% simultaneously satisfy security requirements—revealing a severe security gap in current agents. Lightweight mitigation strategies, such as vulnerability-aware prompting, yield marginal improvements. This work provides the first empirical evidence of a critical decoupling between functional correctness and security in LLM programming agents, establishing both foundational insights and an essential evaluation infrastructure for developing trustworthy, secure coding assistants.

Technology Category

Application Category

📝 Abstract

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking security vulnerabilities in agent-generated code from real-world tasks

Evaluating coding agents' poor performance in producing secure software solutions

Assessing the risks of deploying vibe coding outputs in security-sensitive applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking real-world coding tasks for vulnerabilities

Evaluating multiple LLM agents on security performance

Testing vulnerability hint augmentation as mitigation strategy

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?