MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical security robustness deficiency of code large language models (LLMs) under multi-turn adversarial prompting. We introduce, for the first time, a *conversational code decomposition attack* paradigm—demonstrating a novel threat that evades existing safety detectors. To enable rigorous evaluation, we construct MOCHA, the first large-scale benchmark supporting both single-turn and multi-turn safety assessment. Leveraging MOCHA, we systematically investigate multi-turn prompt engineering, safety-aware fine-tuning, and cross-model transferability across open- and closed-weight code LLMs. Experimental results reveal pervasive vulnerabilities in state-of-the-art code models. Fine-tuning on MOCHA improves rejection rates for malicious queries by up to 32.4%; notably, this enhancement generalizes effectively to unseen external adversarial examples—even without additional human annotation.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce enchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
Problem

Research questions and friction points this paper is trying to address.

Assessing code LLM robustness against multi-turn malicious prompts
Introducing code decomposition attacks to evade safety filters
Evaluating model vulnerabilities and improving rejection rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces code decomposition attack technique
Develops large-scale benchmark for robustness evaluation
Enhances model robustness via fine-tuning approach