MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing evaluation methods for mutual exclusivity (ME) bias in vision-language models (VLMs) lack fine-grained, high-fidelity assessment capabilities. Method: We introduce MEBench—the first dedicated ME benchmark for VLMs—by adapting the ME cognitive mechanism from child language acquisition into VLM evaluation. Our approach integrates spatial reasoning to design a controllable synthetic scene generation pipeline and proposes multi-dimensional ME-aware metrics, including referential exclusivity accuracy and spatial consistency score. Contributions/Results: (1) We provide the first reproducible, quantitative ME-bias evaluation framework supporting both zero-shot and fine-tuned VLMs; (2) We reveal that state-of-the-art VLMs exhibit strong ME tendencies yet critically lack semantic-spatial co-reasoning capability; (3) MEBench is publicly released to advance research on cognitive bias diagnosis and controllable learning in VLMs.

Technology Category

Application Category

📝 Abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating mutual exclusivity bias in vision-language models

Incorporating spatial reasoning for realistic ME assessment

Developing scalable data pipeline for diverse annotated scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for mutual exclusivity bias evaluation

Incorporates spatial reasoning for realistic settings

Flexible data generation pipeline for diverse scenes

🔎 Similar Papers

VLind-Bench: Measuring Language Priors in Large Vision-Language Models