QuizRank: Picking Images by Quizzing VLMs

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the challenge that Wikipedia editors—lacking domain expertise—struggle to select appropriate images and distinguish visually similar illustrations, this paper proposes QuizRank, a multiple-choice question-based image ranking framework. Methodologically, it leverages large language models to generate descriptive prompts capturing concept-specific visual features, and jointly employs vision-language models to construct contrastive multiple-choice questions that juxtapose target images with semantically or visually confounding distractors; images are then ranked according to model performance on these questions. Its key contribution lies in introducing a contrastive learning paradigm, markedly improving discrimination of fine-grained visual differences (e.g., between highly similar objects). Experiments demonstrate strong agreement between QuizRank’s rankings and human judgments (Spearman’s ρ > 0.85) and consistent superiority over existing unsupervised image assessment methods across multiple image–text matching benchmarks, effectively enabling non-expert editors to efficiently identify highly explanatory illustrations.

Technology Category

Application Category

📝 Abstract

Images play a vital role in improving the readability and comprehension of Wikipedia articles by serving as `illustrative aids.' However, not all images are equally effective and not all Wikipedia editors are trained in their selection. We propose QuizRank, a novel method of image selection that leverages large language models (LLMs) and vision language models (VLMs) to rank images as learning interventions. Our approach transforms textual descriptions of the article's subject into multiple-choice questions about important visual characteristics of the concept. We utilize these questions to quiz the VLM: the better an image can help answer questions, the higher it is ranked. To further improve discrimination between visually similar items, we introduce a Contrastive QuizRank that leverages differences in the features of target (e.g., a Western Bluebird) and distractor concepts (e.g., Mountain Bluebird) to generate questions. We demonstrate the potential of VLMs as effective visual evaluators by showing a high congruence with human quiz-takers and an effective discriminative ranking of images.

Problem

Research questions and friction points this paper is trying to address.

Automating image selection for Wikipedia articles

Ranking images using visual question answering with VLMs

Improving discrimination between visually similar concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs and VLMs for image ranking

Generating multiple-choice questions from text

Using contrastive quizzes for better discrimination

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?