Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to class-directed prompt injection attacks in text classification tasks—e.g., sentiment analysis—where adversaries exploit the model’s reliance on label semantics (e.g., “positive”/“negative”) to induce targeted misclassifications. To address this, we propose a lightweight, model-agnostic, and retraining-free defense: semantic label transformation, which replaces original labels with semantically unrelated or weakly related alias labels (e.g., “blue”/“yellow”), thereby severing the semantic linkage between adversarial prompts and model outputs. This is the first approach to treat label semantics themselves as a defense mechanism. We construct semantic-aligned and semantic-misaligned label mappings via few-shot learning and enhance robustness through linguistic analysis. Evaluated across nine mainstream LLMs, our method substantially restores classification accuracy under attack, consistently outperforming the undefended baseline across most configurations.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

Problem

Research questions and friction points this paper is trying to address.

Defends against prompt injection attacks in LLM sentiment classification

Conceals true labels with disguised aliases to prevent adversarial overrides

Evaluates lightweight defense across multiple models without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Label Disguise Defense conceals true labels with aliases

Model learns alias mappings through few-shot demonstrations

Semantically aligned alias labels enhance robustness against attacks

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization