Disparities In Negation Understanding Across Languages In Vision-Language Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the pervasive affirmative bias of vision-language models in processing negation and their uneven performance across multilingual settings, particularly for non-Latin-script languages. The authors construct the first human-validated multilingual benchmark for negation understanding, covering seven typologically diverse languages—English, Chinese, Arabic, Greek, Russian, Tagalog, and Spanish—and systematically evaluate prominent models including CLIP, SigLIP, MultiCLIP, and the SpaceVLM negation-correction method. The work reveals, for the first time, systematic cross-lingual disparities in negation comprehension, demonstrating that linguistic properties—such as morphology, writing systems, and negation structures—significantly influence model performance. Standard CLIP performs near chance on non-Latin scripts, MultiCLIP achieves the strongest and most balanced results, and SpaceVLM’s effectiveness varies markedly by language type, underscoring the critical role of linguistic diversity in fair model evaluation.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

Problem

Research questions and friction points this paper is trying to address.

negation understanding

vision-language models

multilingual fairness

affirmation bias

cross-lingual disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual benchmark

negation understanding

vision-language models