🤖 AI Summary
This work exposes a critical security vulnerability in large language models (LLMs): the internalization of malicious content due to training data contamination, specifically manifesting as the generation of phishing URLs embedded in ostensibly benign code under harmless developer prompts. To systematically assess this risk, we introduce the first scalable, automated auditing framework that synthesizes semantically innocuous developer prompts using known phishing domains. We conduct large-scale red-teaming evaluations across state-of-the-art production-grade models—including GPT-4o and Llama-4-Scout. Our experiments provide the first systematic empirical evidence that all tested models exhibit robustness failures attributable to data poisoning: on average, 4.2% of generated code snippets contain malicious URLs. Crucially, we identify 177 cross-model-triggering innocuous prompts—inputs that consistently elicit harmful outputs across diverse architectures—establishing a novel paradigm and empirical benchmark for LLM training data security assessment.
📝 Abstract
Large Language Models (LLMs) have become critical to modern software development, but their reliance on internet datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To evaluate this threat, this paper introduces a scalable, automated audit framework that synthesizes innocuous, developer-style prompts from known scam databases to query production LLMs and determine if they generate code containing harmful URLs. We conducted a large-scale evaluation across four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), and found a systemic vulnerability, with all tested models generating malicious code at a non-negligible rate. On average, 4.2% of programs generated in our experiments contained malicious URLs. Crucially, this malicious code is often generated in response to benign prompts. We manually validate the prompts which cause all four LLMs to generate malicious code, and resulting in 177 innocuous prompts that trigger all models to produce harmful outputs. These results provide strong empirical evidence that the training data of production LLMs has been successfully poisoned at scale, underscoring the urgent need for more robust defense mechanisms and post-generation safety checks to mitigate the propagation of hidden security threats.