Safety and Security Analysis of Large Language Models: Risk Profile and Harm Potential

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This study systematically evaluates the vulnerability of nine mainstream large language models (LLMs) to adversarial prompts across 24 security threat categories—including violence, illegal content, hazardous code, and cybersecurity attacks. To address the lack of standardized, quantitative safety assessment, we propose the Risk Severity Index (RSI) and introduce the first scalable, quantifiable, cross-model security evaluation framework, integrating multidimensional automated testing with human-in-the-loop validation. Empirical results reveal pervasive safety filtering deficiencies in current LLMs—particularly among open-weight and rapidly iterated models—and demonstrate that RSI effectively characterizes risk gradients and exposes weaknesses in alignment mechanisms. Our work establishes an agile measurement paradigm for dynamically evolving LLM security threats, underscoring the urgent need for strengthened alignment techniques, rigorous deployment governance, and collaborative, multi-stakeholder security stewardship.

Technology Category

Application Category

📝 Abstract

While the widespread deployment of Large Language Models (LLMs) holds great potential for society, their vulnerabilities to adversarial manipulation and exploitation can pose serious safety, security, and ethical risks. As new threats continue to emerge, it becomes critically necessary to assess the landscape of LLMs' safety and security against evolving adversarial prompt techniques. To understand the behavior of LLMs, this research provides an empirical analysis and risk profile of nine prominent LLMs, Claude Opus 4, DeepSeek V3 (both open-source and online), Gemini 2.5 Flash, GPT-4o, Grok 3, Llama 4 Scout, Mistral 7B, and Qwen 3 1.7B, against 24 different security and safety categories. These LLMs are evaluated on their ability to produce harmful responses for adversarially crafted prompts (dataset has been made public) for a broad range of safety and security topics, such as promotion of violent criminal behavior, promotion of non-violent criminal activity, societal harms related to safety, illegal sexual content, dangerous code generation, and cybersecurity threats beyond code. Our study introduces the Risk Severity Index (RSI), an agile and scalable evaluation score, to quantify and compare the security posture and creating a risk profile of LLMs. As the LLM development landscape progresses, the RSI is intended to be a valuable metric for comparing the risks of LLMs across evolving threats. This research finds widespread vulnerabilities in the safety filters of the LLMs tested and highlights the urgent need for stronger alignment, responsible deployment practices, and model governance, particularly for open-access and rapidly iterated models.

Problem

Research questions and friction points this paper is trying to address.

Analyzing LLM vulnerabilities to adversarial manipulation and exploitation

Evaluating LLM safety risks across 24 security categories

Quantifying risk severity through empirical analysis of nine prominent LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical analysis of nine LLMs

Risk Severity Index evaluation metric

Testing 24 security safety categories

🔎 Similar Papers

Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices