Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the lack of a unified benchmark for evaluating the safety of code generation models, which often conflates requests for executable malicious code with those seeking harmful security-related knowledge. To resolve this ambiguity, the authors propose the first structured, expert-consensus-driven benchmark for malicious code compliance assessment. A panel of five experts (Fleiss’ κ = 0.767) rigorously categorized 6,675 prompts from eight corpora, clearly distinguishing between the two request types. The resulting high-quality dataset comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, with 95% of samples receiving agreement from at least four experts. This benchmark provides a reliable and comparable foundation for quantitatively evaluating a model’s ability to refuse unsafe requests.

📝 Abstract

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

Problem

Research questions and friction points this paper is trying to address.

malicious code

code generation models

refusal benchmark

compliance measurement

consensus labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

consensus labeling

malicious code generation

refusal benchmark