Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Small language models (SLMs) are increasingly deployed on edge devices, yet their vulnerability to jailbreaking attacks remains systematically unassessed. Method: We conduct the first large-scale empirical evaluation of 63 SLMs against eight state-of-the-art jailbreaking techniques, employing a multidimensional adversarial benchmarking framework and cross-model-family attribution analysis. Contribution/Results: We find an average attack success rate of 47.6%, with 38.1% of models failing even on direct harmful queries. Our analysis identifies four key determinants of robustness—model scale, architecture, training data, and optimization techniques—and reveals “intrinsic safety awareness” as a critical defense mechanism. We propose a safety-first SLM design paradigm and demonstrate that three mainstream prompt-level defenses exhibit significant limitations. This work establishes foundational theoretical insights and practical guidelines for building trustworthy SLM ecosystems.

Technology Category

Application Category

📝 Abstract

Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs) due to their low computational demands, enhanced privacy guarantees and comparable performance in specific domains through light-weight fine-tuning. Deploying SLMs on edge devices, such as smartphones and smart vehicles, has become a growing trend. However, the security implications of SLMs have received less attention than LLMs, particularly regarding jailbreak attacks, which is recognized as one of the top threats of LLMs by the OWASP. In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks. Through systematically evaluation on 63 SLMs from 15 mainstream SLM families against 8 state-of-the-art jailbreak methods, we demonstrate that 47.6% of evaluated SLMs show high susceptibility to jailbreak attacks (ASR>40%) and 38.1% of them can not even resist direct harmful query (ASR>50%). We further analyze the reasons behind the vulnerabilities and identify four key factors: model size, model architecture, training datasets and training techniques. Moreover, we assess the effectiveness of three prompt-level defense methods and find that none of them achieve perfect performance, with detection accuracy varying across different SLMs and attack methods. Notably, we point out that the inherent security awareness play a critical role in SLM security, and models with strong security awareness could timely terminate unsafe response with little reminder. Building upon the findings, we highlight the urgent need for security-by-design approaches in SLM development and provide valuable insights for building more trustworthy SLM ecosystem.

Problem

Research questions and friction points this paper is trying to address.

Evaluates vulnerabilities of small language models to jailbreak attacks.

Identifies key factors influencing SLM susceptibility to attacks.

Assesses effectiveness of prompt-level defense methods against attacks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale study on SLM jailbreak vulnerabilities

Evaluated 63 SLMs against 8 jailbreak methods

Identified key factors affecting SLM security

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance