๐ค AI Summary
This study addresses the vulnerability of large language models (LLMs) to prompt injection and jailbreaking attacks in deployment scenarios, which pose significant security risks. We construct a large-scale, human-annotated dataset of such attacks and present the first systematic evaluation of multiple open-source LLMsโ susceptibility to them. Our analysis reveals notable differences in modelsโ safety response behaviors, including tendencies toward silence or refusal, which we attribute to internal architectural mechanisms. Furthermore, we evaluate various lightweight, inference-time defense strategies that require no model retraining. While these defenses effectively mitigate simple attacks, they remain susceptible to evasion when confronted with long-context or high-complexity prompts, thereby highlighting critical limitations in current defensive approaches.
๐ Abstract
Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.