🤖 AI Summary
This study addresses a critical limitation in existing benchmarks for malicious code generation, which fail to distinguish between requests for executable weapons and those seeking security-related knowledge, thereby conflating language models’ refusal mechanisms. To resolve this, the authors propose a weapon–knowledge dichotomy framework and construct a high-agreement dataset by classifying 3,133 prompts through majority-vote consensus among five major language models from Anthropic, OpenAI, Google, Zhipu AI, and Alibaba. Applying a three-out-of-five majority rule and Fleiss’ kappa for inter-rater reliability, they derive a primary set of 1,554 consensus-CODE samples and a contrasting set of 388 consensus-KNOWLEDGE samples. The annotation achieves a Fleiss’ kappa of 0.876, with 69.3% of samples receiving unanimous agreement and no exclusions, marking the first reliable and unambiguous separation of these two request types.
📝 Abstract
Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba), each issuing a binary CODE/KNOWLEDGE label per prompt under a three-of-five majority rule, with inter-rater reliability quantified by Fleiss' kappa with bootstrap 95% confidence intervals. Across all 3,133 prompts the five judges achieve kappa = 0.876 [95% CI: 0.862, 0.888], "almost perfect" agreement by the Landis & Koch convention, with 69.3% of prompts unanimous at five-of-five; all 3,133 prompts reached the 3-of-5 threshold, so the consensus pipeline produced zero ambiguity-excluded prompts. Whether the axis separates model behavior in practice is an empirical question this paper leaves to the companion benchmark study; the present contribution is the reliability-documented artifact and the case for treating the weapons-versus-knowledge distinction as the organizing axis of code-safety evaluation.