π€ AI Summary
This paper addresses the problem of evaluating commonsense consistency in image-text pairs (e.g., βa boy holding a vacuum cleaner in a desertβ)βa critical yet underexplored challenge in vision-language understanding. Methodologically, it introduces the first atomic-fact-based evaluation framework: fine-grained atomic facts are extracted from image-text inputs using large vision-language models (LVLMs), encoded via Transformers, and then classified for inter-fact consistency using a lightweight, differentiable attention-based pooling classifier. Key contributions include: (1) establishing an atomic-fact-driven paradigm for commonsense modeling; and (2) proposing TLGβa parameter-efficient, cross-domain generalizable classification architecture. Evaluated on the WHOOPS! and WEIRD benchmarks, the method achieves new state-of-the-art accuracy, with significant average improvements over prior work, while reducing model parameters by over 40% compared to existing approaches.
π Abstract
Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.