CIVET: Systematic Evaluation of Understanding in VLMs

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack standardized, interpretable evaluation of object attribute and relational understanding. To address this, we propose CIVET—a novel, controlled synthetic-image-based evaluation framework. CIVET introduces a structured attribute-relation testing protocol, cross-model consistency analysis, and human-AI comparative experiments, enabling hypothesis-driven, statistically rigorous assessment while mitigating annotation noise, dataset bias, and scene complexity. Experimental results reveal that state-of-the-art VLMs accurately recognize only a limited set of basic attributes; exhibit significant performance degradation with object positional variation; and demonstrate weak comprehension of fundamental spatial and semantic relations—falling substantially short of human-level capability. CIVET establishes the first controlled, reproducible, and attributable benchmark for probing deep semantic understanding in VLMs.

Technology Category

Application Category

📝 Abstract
While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' understanding of object properties and relations
Addresses lack of standardized systematic evaluation for VLMs
Compares VLM performance with human-level accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CIVET for systematic VLM evaluation
Uses controlled stimuli to test object properties
Reveals VLMs' limitations in understanding relations
🔎 Similar Papers
No similar papers found.
M
Massimo Rizzoli
Signals and Interactive Systems Lab, University of Trento, Italy
S
Simone Alghisi
Signals and Interactive Systems Lab, University of Trento, Italy
O
Olha Khomyn
Signals and Interactive Systems Lab, University of Trento, Italy
Gabriel Roccabruna
Gabriel Roccabruna
PhD Student, SISLab, University of Trento
NLPDialogue SystemMachine LearningDeep Learning
S
Seyed Mahed Mousavi
Signals and Interactive Systems Lab, University of Trento, Italy
Giuseppe Riccardi
Giuseppe Riccardi
Professor of Computer Science, University of Trento Italy
Natural Language ProcessingSpeech ProcessingDialogueMachine Learning