Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper addresses the problem of evaluating commonsense consistency in image-text pairs (e.g., “a boy holding a vacuum cleaner in a desert”)—a critical yet underexplored challenge in vision-language understanding. Methodologically, it introduces the first atomic-fact-based evaluation framework: fine-grained atomic facts are extracted from image-text inputs using large vision-language models (LVLMs), encoded via Transformers, and then classified for inter-fact consistency using a lightweight, differentiable attention-based pooling classifier. Key contributions include: (1) establishing an atomic-fact-driven paradigm for commonsense modeling; and (2) proposing TLG—a parameter-efficient, cross-domain generalizable classification architecture. Evaluated on the WHOOPS! and WEIRD benchmarks, the method achieves new state-of-the-art accuracy, with significant average improvements over prior work, while reducing model parameters by over 40% compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.

Problem

Research questions and friction points this paper is trying to address.

Evaluating common sense consistency in weird images

Assessing image realism using LVLMs and Transformers

Improving performance on WHOOPS! and WEIRD datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Vision-Language Models for fact extraction

Employs Transformer-based encoder for processing

Fine-tunes compact attention-pooling classifier

🔎 Similar Papers

No similar papers found.