๐ค AI Summary
This work addresses the lack of a unified benchmark for evaluating the safety of code generation models, which often conflates requests for executable malicious code with those seeking harmful security-related knowledge. To resolve this ambiguity, the authors propose the first structured, expert-consensus-driven benchmark for malicious code compliance assessment. A panel of five experts (Fleissโ ฮบ = 0.767) rigorously categorized 6,675 prompts from eight corpora, clearly distinguishing between the two request types. The resulting high-quality dataset comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, with 95% of samples receiving agreement from at least four experts. This benchmark provides a reliable and comparable foundation for quantitatively evaluating a modelโs ability to refuse unsafe requests.
๐ Abstract
A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.