VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the limited capability of current vision-language models (VLMs) in comparative reasoning tasks that require discerning subtle inter-image differences, as well as the absence of fine-grained evaluation benchmarks reflecting real-world scenarios. To this end, we introduce VLM-SubtleBench, the first cross-domain, fine-grained benchmark for comparative reasoning, encompassing ten categories of subtle visual discrepancies and incorporating image pairs from specialized domains such as industrial inspection, aviation, and medical imaging. Through systematic evaluation of prominent open- and closed-source VLMs, our study reveals that existing models substantially underperform humans—particularly along dimensions of state, spatial relations, and viewpoint—and exhibit pronounced degradation in reasoning accuracy within professional contexts. These findings highlight critical bottlenecks in VLMs’ fine-grained visual understanding capabilities.

Technology Category

Application Category

📝 Abstract

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs'reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

Problem

Research questions and friction points this paper is trying to address.

subtle comparative reasoning

vision-language models

benchmark

visual differences

human-level reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

subtle comparative reasoning

vision-language models

benchmark