🤖 AI Summary
Large language models (LLMs) face severe threats from prompt injection attacks, yet existing defenses struggle to simultaneously achieve generalizability and interpretability. Method: We propose DMPI-PMHFE, a dual-channel fusion detection framework that jointly encodes semantic features (via DeBERTa-v3-base) and structural features (derived from attack-specific syntactic patterns, punctuation anomalies, and control-word heuristics) at the feature level—ensuring both robustness and transparency. Contribution/Results: On multiple benchmark datasets, DMPI-PMHFE achieves an F1-score of 98.7%, significantly outperforming state-of-the-art methods. In real-world deployment across mainstream LLMs—including GLM-4, LLaMA-3, Qwen2.5, and GPT-4o—it reduces average attack success rates by 92.4%. The framework demonstrates cross-model robustness, low computational overhead, and practical suitability for real-time, production-grade defense.
📝 Abstract
With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.