What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

📅 2024-05-24

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address the limitations of handcrafted prompt engineering and poor generalization in zero-shot image classification, this paper proposes a prompt-free unified framework. It leverages multimodal large language models (MLLMs) to automatically generate descriptive textual captions for input images and maps these captions into fixed-dimensional visual features via cross-modal embedding alignment. The generated text embeddings are then fused with image features and fed into a lightweight linear classifier for end-to-end zero-shot recognition. The key contribution lies in the first systematic exploitation of MLLMs’ generative capacity to construct discriminative, text-driven visual representations—eliminating dataset-specific prompt design. Evaluated on ten standard benchmarks, the method achieves an average accuracy gain of 6.2 percentage points over prior state-of-the-art methods without fine-tuning, including a 6.8-point improvement on ImageNet.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method achieved an accuracy gain of 6.2 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods re-evaluated with the same setup. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.

Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot image classification using multimodal LLMs

Generating textual representations from images for classification

Improving accuracy without dataset-specific prompt engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs generate textual image representations.

Cross-modal embedding space creates fixed-dimensional features.

Single prompt set simplifies zero-shot classification process.

🔎 Similar Papers

No similar papers found.