Evaluating the encoding competence of visual language models using uncommon actions

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
Current vision-language models struggle to distinguish between syntactically correct but commonsense-violating image–text pairs, revealing a lack of deep reasoning about semantic plausibility and physical feasibility. To address this limitation, this work introduces the UAIT dataset, which focuses specifically on unconventional action scenarios. Leveraging large language models, few-shot prompting, and text-to-image generation, UAIT semi-automatically constructs high-quality image–text multiple-choice questions designed for fine-grained semantic reasoning evaluation. Experimental results demonstrate that state-of-the-art models perform substantially worse than humans on UAIT, while lightweight models exhibit significant performance gains after targeted fine-tuning. These findings validate the effectiveness of domain-specific adaptation and establish UAIT as a valuable benchmark for diagnosing and improving model robustness in complex semantic reasoning tasks.

Technology Category

Application Category

📝 Abstract
We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

visual language models
semantic understanding
uncommon-sense actions
image-text reasoning
agent-patient relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual language models
uncommon-sense reasoning
semantic understanding
image-text benchmark
fine-grained evaluation