Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

State-of-the-art large-scale vision-language models (e.g., CLIP) achieve strong zero-shot learning (ZSL) performance but lack interpretability due to their reliance on global image–class embedding matching. Method: We propose LaZSL, a training-free, annotation-free interpretable ZSL framework that leverages optimal transport to establish fine-grained alignment between local image regions and discrete semantic attributes—enabling the construction of inherently interpretable classifiers. Built upon off-the-shelf pre-trained vision-language models, LaZSL automatically discovers visual–semantic correspondences without supervision. Contribution/Results: LaZSL significantly improves classification accuracy and cross-domain generalization across multiple benchmarks. Moreover, it supports intuitive, pixel-level attribution visualization, effectively reconciling high performance with model transparency.

Technology Category

Application Category

📝 Abstract

Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: https://github.com/shiming-chen/LaZSL.

Problem

Research questions and friction points this paper is trying to address.

Lack of interpretability in vision-language models for zero-shot learning

Difficulty aligning local visual features with discrete attributes

Need for effective alignment without additional training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local visual-semantic alignment via optimal transport

Interpretable zero-shot learning without additional training

Aligns visual regions with corresponding attributes

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models