🤖 AI Summary
This work addresses the security vulnerabilities of large language models (LLMs) in low-resource languages (LRLs), stemming from scarce training data and biased evaluation practices. We propose the first scalable, multilingual automated security vulnerability assessment framework. It integrates cross-lingual adversarial sample generation, response consistency scoring, and human-in-the-loop calibration, enabling systematic safety evaluation across six mainstream LLMs in eight languages—including six LRLs. Experimental results reveal that LRL security weaknesses primarily arise from model performance degradation rather than failure of malicious alignment. Automated assessments achieve over 85% agreement with human judgments. Crucially, this study is the first to demonstrate that LRL security risks are fundamentally non-adversarial in nature—i.e., rooted in generalization failures rather than targeted jailbreaking. Our framework establishes a reproducible, extensible methodological foundation for multilingual safety evaluation, advancing both empirical rigor and practical applicability in LRL security research.
📝 Abstract
Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.