Image Recognition with Vision and Language Embeddings of VLMs

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study investigates the complementary roles of visual and language-guided pathways in dual-encoder vision-language models (VLMs) for zero-shot image classification. Focusing on ImageNet-1K and its refined label set, we systematically evaluate visual embeddings and textual prompts from models including SigLIP-2 and RADIOv2.5, analyzing how prompt design and category semantic diversity affect performance. We propose a training-free, class-level accuracy-weighted fusion strategy that dynamically integrates visual and language embedding similarities via a k-nearest neighbors classifier. Our method avoids end-to-end fine-tuning and leverages modality-specific strengths without architectural modification. Experiments demonstrate substantial gains over unimodal baselines, achieving state-of-the-art zero-shot classification accuracy on ImageNet. This work provides the first empirical validation of complementary dual-pathway behavior in dual-encoder VLMs and establishes the efficacy of lightweight, accuracy-aware embedding fusion for zero-shot recognition.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-only and language-guided image classification with VLMs

Analyzing key factors affecting VLM accuracy in zero-shot classification

Introducing fusion method to combine vision and language strengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language models for zero-shot classification

Dual-encoder VLMs with visual similarity analysis

Learning-free fusion method using per-class precision

🔎 Similar Papers

No similar papers found.