The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study presents the first systematic quantification of the polymorphic capabilities of commercial large language models (LLMs) in generating functionally equivalent yet structurally diverse malicious payloads, assessing their potential to evade signature- and similarity-based detection mechanisms. The authors propose a two-agent, four-stage automated pipeline that leverages abstract syntax trees (ASTs) and embedding vectors to measure structural and semantic divergence, respectively, and introduce two prompting strategies: one specifying only functional requirements and another incorporating history-aware, differentiation-oriented guidance. Experimental results demonstrate that LLMs naturally produce payloads with high structural diversity even without explicit instructions; explicit guidance further significantly enhances structural variation, increasing API calls only marginally (from 4.2 to 4.5 on average) while raising token consumption by approximately fivefold—thereby enabling cost-effective evasion of conventional detection systems.

📝 Abstract

Malware authors have traditionally relied on polymorphic techniques to produce variants in the same malware family, complicating signature-based detection. Integrating generative AI into offensive toolchains enables attackers to synthesize structurally diverse payloads with identical behavior, raising the question of how much polymorphism LLMs provide. Recent work has assumed that LLMs can produce sufficiently polymorphic payloads, leaving unquantified the variation that emerges when an attacker repeatedly builds the same payload, or explicitly instructs the model to avoid prior implementations. In this work, we measure the polymorphic capacity of a commercial model (Claude Opus 4.6) as an automated malware generator. We build a dual-agent, four-stage pipeline that generates, tests, and refines a data-exfiltration payload comprising file traversal, encryption, exfiltration, and integration. We produce payloads in two settings: using prompts that specify only functional requirements, and using prompts that inject a structured history of prior outcomes to force divergence. We measure pairwise distances along structural (AST) and semantic (embedding) axes, finding that when polymorphism is not explicitly required, structural distances are high while semantic distances remain low; i.e., implementations diverge widely without changing high-level behavior. Explicit prompting substantially amplifies this structural diversity while preserving correctness, at the cost of roughly 5 times more tokens but only a small increase in LLM calls (from $4.2$ to $4.5$ per payload, with effective API costs of \$0.41 and \$0.73). These results show that a single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse payloads, facilitating the evasion of signature-based detection rules and similarity-based clustering.

Problem

Research questions and friction points this paper is trying to address.

polymorphism

LLM-generated malware

signature-based detection evasion

structural diversity

behavioral equivalence

Innovation

Methods, ideas, or system contributions that make the work stand out.

polymorphism

large language models

malware generation