🤖 AI Summary
This work addresses the prevailing English-centric bias in existing multimodal safety evaluations, which often overlook risks rooted in local cultural contexts. To bridge this gap, we introduce KSAFE-MM—the first multimodal safety benchmark tailored to Korean culture—designed through linguistic contextualization, culturally grounded visual queries, and jailbreaking-style textual prompts to jointly assess both general and culture-specific vulnerabilities. Experiments across twelve state-of-the-art multimodal large language models demonstrate that culturally contextualized attacks substantially increase attack success rates, with the ProgramExecution jailbreaking strategy achieving a 74.2% ASR. Our findings further reveal a systematic trade-off between robust safety alignment and excessive refusal behavior in current models.
📝 Abstract
Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.