🤖 AI Summary
Existing backdoor attacks struggle to effectively target dual-channel large language models that integrate knowledge graph–based soft prompts. This work proposes BadSKP, the first method to reveal and exploit the semantic anchoring effect inherent in knowledge graph soft prompts. By employing a multi-stage optimization strategy, BadSKP implants backdoors at the graph–soft prompt interface through adversarial target embedding construction, poisoned node embedding optimization, and generation of fluent adversarial attribute text. The approach achieves targeted semantic hijacking of the graph channel, demonstrating high attack success rates across two model architectures and four datasets. It substantially outperforms purely text-based attacks and maintains strong robustness under various defensive settings, including frozen parameters, trojan insertion, and perplexity-based detection.
📝 Abstract
Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.