MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing vision-language models struggle to accurately align image-text pairs that are semantically or visually highly similar yet differ in critical details—particularly in safety- and culture-sensitive contexts—leading to potential misjudgments. To this end, the paper introduces the first fine-grained minimal-pair benchmarks for safety (MiS) and culture (MiC), comprising contrastive image-text pairs differing only by subtle, contextually crucial variations, to systematically evaluate cross-modal alignment capabilities of state-of-the-art models. Experimental results reveal that these models are significantly better at identifying correct matches than rejecting incorrect ones, and they perform more robustly in image-to-text tasks than in the reverse direction, highlighting a notable deficiency in fine-grained semantic understanding.

Technology Category

Application Category

📝 Abstract
Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.
Problem

Research questions and friction points this paper is trying to address.

fine-grained alignment
vision-language models
safety
culture
minimal pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained alignment
minimal-pairs benchmark
vision-language models
cross-modal grounding
cultural proxies