🤖 AI Summary
Existing multimodal large language model (MLLM) safety benchmarks suffer from narrow attack scenarios, lack of standardized defense evaluation, and non-reproducible tooling. To address these limitations, this paper introduces the first unified multimodal jailbreaking benchmark for comprehensive attack-defense evaluation. It integrates 13 attack methods, 15 defense strategies, and a high-quality dataset spanning nine critical risk domains. We propose a three-dimensional safety assessment framework—measuring harmfulness, intent consistency, and response comprehensiveness—to jointly quantify safety and utility. Furthermore, we design novel multimodal safety data construction techniques and a modular attack-defense integration framework, and release an open-source, reproducible evaluation platform supporting systematic comparison of both open- and closed-source MLLMs. Extensive experiments across 10 open-source and 8 closed-source models reveal widespread vulnerabilities, significantly advancing standardization and reproducibility in multimodal safety evaluation.
📝 Abstract
Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.