Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study systematically evaluates the zero-shot and few-shot capabilities of multimodal large language models (MLLMs)—including GPT-4o, Gemini 2.5 Pro, CLIP, and SigLIP—for single-label classification of Christian icons, aiming to assess their feasibility as drop-in replacements for supervised models (e.g., ResNet50) in digital humanities metadata annotation without fine-tuning. Method: We conduct benchmarking across ArtDL, ICONCLASS, and Wikidata datasets, integrating Iconclass semantic class descriptions and minimal exemplars; we further analyze the impact of prompt engineering and context augmentation. Contribution/Results: Gemini 2.5 Pro and GPT-4o achieve zero-shot accuracy significantly surpassing fine-tuned ResNet50. Incorporating semantic class descriptions markedly improves zero-shot performance, whereas few-shot learning yields marginal gains. Our work establishes general-purpose MLLMs as a viable and practically promising new paradigm for cultural image understanding—enabling robust, label-efficient annotation in heritage domains without task-specific adaptation.

Technology Category

Application Category

📝 Abstract

This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs on Christian iconography classification tasks

Assessing zero-shot versus few-shot performance with contextual information

Comparing general-purpose models against specialized supervised classifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated multimodal LLMs on Christian iconography classification

Tested zero-shot and few-shot learning with contextual prompts

Compared performance against fine-tuned ResNet50 baselines

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models