Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

πŸ“… 2025-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the problem of evaluating commonsense consistency in image-text pairs (e.g., β€œa boy holding a vacuum cleaner in a desert”)β€”a critical yet underexplored challenge in vision-language understanding. Methodologically, it introduces the first atomic-fact-based evaluation framework: fine-grained atomic facts are extracted from image-text inputs using large vision-language models (LVLMs), encoded via Transformers, and then classified for inter-fact consistency using a lightweight, differentiable attention-based pooling classifier. Key contributions include: (1) establishing an atomic-fact-driven paradigm for commonsense modeling; and (2) proposing TLGβ€”a parameter-efficient, cross-domain generalizable classification architecture. Evaluated on the WHOOPS! and WEIRD benchmarks, the method achieves new state-of-the-art accuracy, with significant average improvements over prior work, while reducing model parameters by over 40% compared to existing approaches.

Technology Category

Application Category

πŸ“ Abstract
Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.
Problem

Research questions and friction points this paper is trying to address.

Evaluating common sense consistency in weird images
Assessing image realism using LVLMs and Transformers
Improving performance on WHOOPS! and WEIRD datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Large Vision-Language Models for fact extraction
Employs Transformer-based encoder for processing
Fine-tunes compact attention-pooling classifier
πŸ”Ž Similar Papers
No similar papers found.