A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses a critical limitation in existing benchmarks for malicious code generation, which fail to distinguish between requests for executable weapons and those seeking security-related knowledge, thereby conflating language models’ refusal mechanisms. To resolve this, the authors propose a weapon–knowledge dichotomy framework and construct a high-agreement dataset by classifying 3,133 prompts through majority-vote consensus among five major language models from Anthropic, OpenAI, Google, Zhipu AI, and Alibaba. Applying a three-out-of-five majority rule and Fleiss’ kappa for inter-rater reliability, they derive a primary set of 1,554 consensus-CODE samples and a contrasting set of 388 consensus-KNOWLEDGE samples. The annotation achieves a Fleiss’ kappa of 0.876, with 69.3% of samples receiving unanimous agreement and no exclusions, marking the first reliable and unambiguous separation of these two request types.

📝 Abstract

Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba), each issuing a binary CODE/KNOWLEDGE label per prompt under a three-of-five majority rule, with inter-rater reliability quantified by Fleiss' kappa with bootstrap 95% confidence intervals. Across all 3,133 prompts the five judges achieve kappa = 0.876 [95% CI: 0.862, 0.888], "almost perfect" agreement by the Landis & Koch convention, with 69.3% of prompts unanimous at five-of-five; all 3,133 prompts reached the 3-of-5 threshold, so the consensus pipeline produced zero ambiguity-excluded prompts. Whether the axis separates model behavior in practice is an empirical question this paper leaves to the companion benchmark study; the present contribution is the reliability-documented artifact and the case for treating the weapons-versus-knowledge distinction as the organizing axis of code-safety evaluation.

Problem

Research questions and friction points this paper is trying to address.

malicious code generation

language model safety

executable weapons

security knowledge

refusal behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

weapons-versus-knowledge

consensus labeling

malicious code generation