Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Vocabulary-free fine-grained visual recognition (VF-FGVR) aims to predict precise, open-vocabulary labels directly in an unrestricted semantic space without predefined category vocabularies—yet remains challenging under label-scarce conditions. Method: This paper introduces NeaR, the first approach to leverage multimodal large language models (MLLMs) to generate noisy open-vocabulary labels for constructing weakly supervised datasets. NeaR integrates CLIP fine-tuning, nearest-neighbor label refinement, and noise-robust training to enable efficient vocabulary-free recognition. Contribution/Results: Evaluated on a newly established VF-FGVR benchmark, NeaR achieves substantial accuracy gains over prior methods. It accelerates inference by over 100× compared to direct MLLM invocation and reduces API costs by more than 90%. By eliminating reliance on fixed taxonomies and expensive oracle annotations, NeaR establishes a scalable, low-cost paradigm for fine-grained recognition—particularly beneficial for resource-constrained domains such as medical imaging.

Technology Category

Application Category

📝 Abstract

Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce extbf{Nea}rest-Neighbor Label extbf{R}efinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.

Problem

Research questions and friction points this paper is trying to address.

Addressing fine-grained visual recognition without predefined labels

Overcoming lack of expert-annotated datasets in sensitive domains

Reducing costs and inference times of multimodal LLMs for recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages MLLMs for label generation

Fine-tunes CLIP with noisy MLLM labels

Efficient nearest-neighbor label refinement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs