Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study systematically evaluates the robustness of 14 mainstream open-source large language models (LLMs) against five categories of prompt injection attacks, revealing significant vulnerabilities in generating harmful content and leaking sensitive information. Existing evaluation metrics inadequately capture the inherent uncertainty of such attacks. Method: We propose Attack Success Probability (ASP)—a fine-grained, probabilistic metric—to quantify attack effectiveness. Our methodology integrates a multi-category prompt injection template library, response semantic parsing, a probabilistic assessment framework, and a human-in-the-loop verification mechanism. Results: Experiments show that “hypnotic attacks” achieve ASP ≈ 90% on models including StableLM2 and Mistral; “ignore-prefix attacks” attain an average ASP >60% across all 14 models, demonstrating strong cross-model generalizability. This work establishes a novel, reproducible benchmark and evaluation paradigm for LLM security assessment.

Technology Category

Application Category

📝 Abstract

Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the $mathbf{14}$ most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model's response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around $90$% ASP. They also indicate that our ignore prefix attacks can break all $mathbf{14}$ open-source LLMs, achieving over $60$% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.

Problem

Research questions and friction points this paper is trying to address.

Investigating prompt injection attacks on open-source LLMs

Proposing Attack Success Probability (ASP) metric

Assessing vulnerability of popular LLMs to hypnotism attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Attack Success Probability (ASP) metric

Introduces simple effective hypnotism attack

Develops ignore prefix attacks breaking LLMs

🔎 Similar Papers

No similar papers found.