Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Large multimodal models (LMMs) underperform significantly compared to lightweight few-shot learning (FSL) expert models on fine-grained visual species recognition (VSR), exposing limitations of their end-to-end prediction paradigm. This work is the first to reveal that LMMs possess strong posterior correction capability in VSR. Leveraging this insight, we propose Posterior Optimization Correction (POC)—a training-free, plug-and-play framework that refines FSL model predictions via LMM-guided re-ranking. POC integrates few-shot visual exemplars with softmax confidence scores, requiring no additional training or human intervention. It is compatible with diverse pre-trained backbones and LMMs. Evaluated across five fine-grained species recognition benchmarks, POC achieves an average accuracy improvement of 6.4%, substantially enhancing both the practical utility and generalization capability of LMMs in domain-specific visual recognition tasks.

Technology Category

Application Category

📝 Abstract

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LMMs' performance on specialized visual species recognition tasks

Compares LMMs with few-shot learning expert models in VSR accuracy

Proposes a post-hoc correction method to improve expert model predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-hoc Correction method re-ranks expert model predictions

Uses LMMs with enriched prompts including confidence scores and examples

Plug-and-play module enhances few-shot learning without extra training

🔎 Similar Papers

Law of Vision Representation in MLLMs