Taxonomy-Aware Evaluation of Vision-Language Models

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the evaluation challenge arising from the semantic mismatch between open-ended textual outputs of vision-language models (VLMs) and hierarchical taxonomy labels. We propose the first ontology-based, semantics-aware evaluation framework for VLMs. Methodologically, we introduce hierarchical precision and recall metrics that quantify partially correct predictions—e.g., when a superclass is correctly identified but the fine-grained subclass is missed—and employ taxonomy-aware path matching and embedding alignment to semantically align generated text with structured class hierarchies. Experiments across multiple fine-grained image classification benchmarks demonstrate that our framework effectively exposes substantial differences in hierarchical consistency among state-of-the-art VLMs, overcoming key limitations of conventional text-similarity metrics in modeling taxonomic semantics. The framework thus provides interpretable, fine-grained diagnostic insights for VLM analysis and optimization.

Technology Category

Application Category

📝 Abstract
When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' unconstrained text predictions against taxonomies
Mapping generated text to hierarchical label spaces accurately
Measuring partial correctness of less-specific but valid answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical precision and recall measures
Taxonomy mapping for VLM predictions
Fine-grained visual classification evaluation
🔎 Similar Papers
No similar papers found.