🤖 AI Summary
This work identifies and formally defines a novel large language model (LLM) security threat—Adversarial Embedding Attack (AEA): adversaries compromise third-party service platforms or poison open-source model checkpoints, injecting promotional, misleading, or malicious content into LLM outputs via adversarial prompting and backdoor fine-tuning—while preserving superficially benign behavior and undermining output integrity. We first categorize five distinct stakeholder victim groups, exposing critical blind spots in existing defenses. Empirical evaluation demonstrates AEA’s high feasibility under low overhead. To counter this threat, we propose a lightweight, self-auditing prompt-based defense that requires no model retraining and effectively mitigates AEA without sacrificing utility. Our contributions include: (1) the first systematic characterization of AEA as a supply-chain–driven output-integrity threat; (2) empirical validation across diverse models and services; and (3) a practical, deployable defense framework establishing a new benchmark for LLM output controllability and supply-chain security research.
📝 Abstract
We introduce Advertisement Embedding Attacks (AEA), a new class of LLM security threats that stealthily inject promotional or malicious content into model outputs and AI agents. AEA operate through two low-cost vectors: (1) hijacking third-party service-distribution platforms to prepend adversarial prompts, and (2) publishing back-doored open-source checkpoints fine-tuned with attacker data. Unlike conventional attacks that degrade accuracy, AEA subvert information integrity, causing models to return covert ads, propaganda, or hate speech while appearing normal. We detail the attack pipeline, map five stakeholder victim groups, and present an initial prompt-based self-inspection defense that mitigates these injections without additional model retraining. Our findings reveal an urgent, under-addressed gap in LLM security and call for coordinated detection, auditing, and policy responses from the AI-safety community.