Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work identifies systematic deficiencies in vision-language models (VLMs) on complex logical reasoning tasks—particularly causal and conditional inference. To address this, we introduce LogicBench, the first large-scale, multi-scenario logical reasoning benchmark, covering nine logical relation types across four realistic application domains. We further propose LogicCLIP, a logic-aware training framework integrating logical structure modeling, fine-grained multiple-choice supervision, contrastive learning, and hierarchical alignment. Extensive experiments demonstrate that LogicCLIP achieves substantial gains over state-of-the-art VLM baselines across all LogicBench domains, attaining near-human logical comprehension performance while preserving or improving SOTA results on general vision-language tasks—including visual question answering and cross-modal retrieval. This work establishes a rigorous evaluation standard, delivers a principled training methodology, and provides empirical evidence for advancing trustworthy logical reasoning in VLMs.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

Problem

Research questions and friction points this paper is trying to address.

Diagnose logical blindspots in Vision-Language Models (VLMs)

Evaluate VLMs' performance in 9 logical categories and 4 scenarios

Propose LogicCLIP to enhance VLMs' logical sensitivity and comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

LogicBench benchmark for logical evaluation

LogicCLIP framework enhances logical sensitivity

Contrastive learning with logic-aware objectives

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling