Securing Large Language Models (LLMs) from Prompt Injection Attacks

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) remain vulnerable to prompt injection attacks, posing critical security risks. Method: This paper proposes a task-specific fine-tuning–based defense, enhancing the HOUYI attack framework with a customized fitness scoring function, semantics-aware mutation strategies, and localized testing to rigorously evaluate robustness against multilingual and code-based adversarial interference. We apply the JATMO method to fine-tune LLaMA-2-7B, Qwen variants, and GPT-3.5-Turbo, and employ genetic algorithms to generate high-quality adversarial prompts. Results: Fine-tuning significantly reduces attack success rates but fails to eliminate vulnerabilities entirely; a fundamental trade-off between generation quality and security persists. Crucially, this work empirically reveals inherent limitations of fine-tuning–only defenses, confirms residual risks under multimodal perturbations, and provides foundational evidence supporting a layered defense architecture integrating fine-tuning, detection, and filtering.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model's instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.
Problem

Research questions and friction points this paper is trying to address.

Evaluates JATMO's robustness against genetic prompt injection attacks
Assesses trade-off between model generation quality and injection vulnerability
Highlights limitations of fine-tuning defenses for layered security strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

JATMO fine-tunes base models for single tasks
HOUYI genetic attack framework mutates adversarial prompts
Custom fitness scoring and mutation logic assess defenses
🔎 Similar Papers
No similar papers found.