🤖 AI Summary
Vocabulary-free fine-grained visual recognition (VF-FGVR) aims to predict precise, open-vocabulary labels directly in an unrestricted semantic space without predefined category vocabularies—yet remains challenging under label-scarce conditions.
Method: This paper introduces NeaR, the first approach to leverage multimodal large language models (MLLMs) to generate noisy open-vocabulary labels for constructing weakly supervised datasets. NeaR integrates CLIP fine-tuning, nearest-neighbor label refinement, and noise-robust training to enable efficient vocabulary-free recognition.
Contribution/Results: Evaluated on a newly established VF-FGVR benchmark, NeaR achieves substantial accuracy gains over prior methods. It accelerates inference by over 100× compared to direct MLLM invocation and reduces API costs by more than 90%. By eliminating reliance on fixed taxonomies and expensive oracle annotations, NeaR establishes a scalable, low-cost paradigm for fine-grained recognition—particularly beneficial for resource-constrained domains such as medical imaging.
📝 Abstract
Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce extbf{Nea}rest-Neighbor Label extbf{R}efinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.