Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of interpretable semantic hierarchies in vision-language model (VLM) embedding spaces and their misalignment with human ontologies. The authors propose a post-processing framework that constructs binary hierarchical trees via agglomerative clustering, annotates internal nodes using a concept lexicon, and introduces consistency metrics at both tree and edge levels to evaluate ontological plausibility. They further develop a lightweight, ontology-guided method to align embedding spaces and integrate an uncertainty-aware early stopping mechanism to support interpretable reasoning. Experiments across 13 pretrained VLMs and 4 image datasets reveal that text encoders yield hierarchies more aligned with human ontologies, while image encoders exhibit greater discriminative power, highlighting a trade-off between zero-shot accuracy and ontological reasonableness.
📝 Abstract
Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.
Problem

Research questions and friction points this paper is trying to address.

semantic hierarchies
vision-language models
embedding space
ontological alignment
zero-shot classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic hierarchy
vision-language models
ontology alignment
post-hoc interpretability
embedding space transformation
🔎 Similar Papers
No similar papers found.