MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods for mutual exclusivity (ME) bias in vision-language models (VLMs) lack fine-grained, high-fidelity assessment capabilities. Method: We introduce MEBench—the first dedicated ME benchmark for VLMs—by adapting the ME cognitive mechanism from child language acquisition into VLM evaluation. Our approach integrates spatial reasoning to design a controllable synthetic scene generation pipeline and proposes multi-dimensional ME-aware metrics, including referential exclusivity accuracy and spatial consistency score. Contributions/Results: (1) We provide the first reproducible, quantitative ME-bias evaluation framework supporting both zero-shot and fine-tuned VLMs; (2) We reveal that state-of-the-art VLMs exhibit strong ME tendencies yet critically lack semantic-spatial co-reasoning capability; (3) MEBench is publicly released to advance research on cognitive bias diagnosis and controllable learning in VLMs.

Technology Category

Application Category

📝 Abstract
This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating mutual exclusivity bias in vision-language models
Incorporating spatial reasoning for realistic ME assessment
Developing scalable data pipeline for diverse annotated scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for mutual exclusivity bias evaluation
Incorporates spatial reasoning for realistic settings
Flexible data generation pipeline for diverse scenes
🔎 Similar Papers
No similar papers found.