🤖 AI Summary
Existing safety evaluations focus narrowly on unimodal (text-only or image-only) settings, overlooking latent risks arising from vision-language joint understanding and failing to distinguish between overtly harmful content and ambiguous edge cases.
Method: We propose VLSU, the first fine-grained evaluation framework for multimodal joint safety understanding, comprising 17 safety categories and 15 harm types across 8,187 real-world image-text pairs, with severity grading and dedicated compositional reasoning tests.
Results: Experiments on 11 state-of-the-art models reveal a sharp accuracy drop—from >90% to 20–55%—on tasks requiring cross-modal reasoning. Critically, 34% of errors stem from correct unimodal predictions but failed multimodal composition, exposing two previously uncharacterized challenges: the “compositional reasoning gap” and the “alignment dilemma” in multimodal safety.
📝 Abstract
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.