CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Visual comparative reasoning—a core capability for multimodal models—remains substantially weaker in current vision-language models (VLMs) than in humans, particularly on foundational tasks such as counting, temporal ordering, and geometric/spatial comparison, revealing systematic deficiencies. To address this, we introduce CompareBench, the first diagnostic benchmark explicitly designed for visual comparative reasoning, comprising 1,000 question-answer pairs spanning four human-intuitive dimensions: quantity, time, geometry, and space. We further release two auxiliary datasets—TallyBench (for counting diagnostics) and HistCaps (for temporal understanding). Using a controlled-variable QA evaluation framework, we systematically assess leading closed-source (GPT-4o, Gemini 2.0) and open-source (Qwen2.5-VL, Qwen3-VL) models. Results demonstrate consistent failures across all models in temporal and spatial reasoning, underscoring CompareBench’s utility in diagnosing limitations and advancing robust, reliable multimodal reasoning.

Technology Category

Application Category

📝 Abstract

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual comparison reasoning in vision-language models

Assessing model performance across quantity, temporal, geometric, and spatial tasks

Identifying systematic blind spots in current multimodal reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CompareBench benchmark for visual comparison evaluation

Derives from TallyBench and HistCaps auxiliary datasets

Evaluates both closed-source and open-source vision-language models

🔎 Similar Papers

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs