Multimodal Large Language Models as Image Classifiers

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the misestimation of multimodal large language model (MLLM) performance in image classification due to inconsistent evaluation protocols and label noise. The authors systematically identify and rectify three critical flaws: output mapping errors, biased distractor design, and open-world assumption mismatches, culminating in ReGTβ€”a high-quality relabeled dataset based on ImageNet-1k. Experimental results demonstrate that corrected evaluation boosts MLLM accuracy by up to 10.8%, substantially narrowing the gap with supervised models. Notably, human annotators adopt MLLM predictions on approximately 50% of challenging samples, underscoring their utility in annotation assistance. This work reveals that the perceived performance gap stems primarily from flawed evaluation practices rather than inherent model limitations, establishing a more reliable benchmark for future MLLM assessment.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation. This work is part of the Aiming for Perfect ImageNet-1k project, see https://klarajanouskova.github.io/ImageNet/.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Image Classification
Evaluation Protocol
Ground Truth Quality
Annotation Noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
evaluation protocol
ground truth quality
ReGT reannotation
image classification
πŸ”Ž Similar Papers
No similar papers found.
N
Nikita Kisel
Visual Recognition Group, Czech Technical University in Prague
Illia Volkov
Illia Volkov
Unknown affiliation
K
Klara Janouskova
Visual Recognition Group, Czech Technical University in Prague
Jiri Matas
Jiri Matas
Professor, Czech Technical University
computer visionimage processingpattern recognitionmachine learning