Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

AI code agents pose significant security risks in software engineering workflows due to their potential execution of malicious code. Method: We propose JAWS-BENCH, the first executable-aware security evaluation framework for AI code agents. Unlike conventional text-based refusal-rate metrics, JAWS-BENCH introduces three progressively complex attack scenarios—empty, single-file, and multi-file workspaces—and employs a hierarchical adjudication mechanism assessing compliance, attack success, syntactic correctness, and runtime executability. Contribution/Results: Experiments reveal that agents frequently reverse initial refusals, increasing attack success by 1.6×; in multi-file settings, the average attack success rate reaches 75%, with 32% of malicious payloads directly deployable. This work constitutes the first systematic demonstration of critical executable-level vulnerabilities in state-of-the-art code LLM agents and establishes a novel paradigm for rigorous, execution-grounded security evaluation of AI coding assistants.

Technology Category

Application Category

📝 Abstract

Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6x -- because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi-step reasoning and tool use.

Problem

Research questions and friction points this paper is trying to address.

Assessing security vulnerabilities in AI code agents through systematic jailbreaking attacks

Evaluating whether code agents actually compile and execute malicious programs

Measuring deployable harm beyond simple refusal detection in multi-step workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with escalating workspace regimes for security

Hierarchical executable-aware framework tests deployable harm

Analysis shows agents increase vulnerability over standalone models

🔎 Similar Papers

No similar papers found.