VAQUUM: Are Vague Quantifiers Grounded in Visual Data?

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether vision-language models (VLMs) interpret image-grounded quantity adjectives—such as “some” and “many”—in alignment with human judgment. To this end, we introduce VAQUUM, the first large-scale, human-annotated benchmark comprising 1,089 images and 20,300 annotations, and design three evaluation protocols: binary classification, ranking, and generation-based matching. Our work is the first to systematically uncover a significant dual-path cognitive dissociation in VLMs: while exhibiting basic numerical sensitivity, their performance on quantity judgment and generation tasks is weakly correlated (r < 0.3), and cross-paradigm consistency is substantially lower than human agreement (p < 0.001). Experiments span prominent VLMs—including BLIP-2, LLaVA, and Qwen-VL—under both zero-shot and fine-tuned settings, confirming robust biases. These findings challenge prevailing assumptions in quantitative semantic modeling and establish a novel evaluation paradigm and empirical benchmark for assessing semantic groundedness in VLMs.

Technology Category

Application Category

📝 Abstract
Vague quantifiers such as"a few"and"many"are influenced by many contextual factors, including how many objects are present in a given context. In this work, we evaluate the extent to which vision-and-language models (VLMs) are compatible with humans when producing or judging the appropriateness of vague quantifiers in visual contexts. We release a novel dataset, VAQUUM, containing 20300 human ratings on quantified statements across a total of 1089 images. Using this dataset, we compare human judgments and VLM predictions using three different evaluation methods. Our findings show that VLMs, like humans, are influenced by object counts in vague quantifier use. However, we find significant inconsistencies across models in different evaluation settings, suggesting that judging and producing vague quantifiers rely on two different processes.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' compatibility with humans
Assess vague quantifiers in visual contexts
Compare human judgments and VLM predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-and-language models analysis
Human-rated quantified statements dataset
Evaluation of vague quantifiers compatibility
🔎 Similar Papers
No similar papers found.