Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Vocabulary-agnostic fine-grained image recognition aims to distinguish visually highly similar subcategories without relying on predefined human-curated vocabularies. Existing approaches are constrained by fixed lexical sets or brittle multi-stage pipelines, leading to error propagation. This paper introduces FiNDR—the first inference-driven, vocabulary-free framework for fine-grained recognition. FiNDR leverages reasoning-enhanced large multimodal models (LMMs) to autonomously generate candidate category names, filter and rank them, and construct lightweight classifiers—eliminating dependence on ground-truth label sets entirely. By challenging the conventional assumption that human-annotated vocabularies bound performance, FiNDR enables open-source LMMs to match or even surpass closed-source counterparts. It achieves state-of-the-art results on major benchmarks, with up to 18.8% relative improvement; notably, in zero-shot settings, it outperforms supervised methods. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.

Problem

Research questions and friction points this paper is trying to address.

Distinguishes visually similar categories without a fixed label set.

Overcomes limitations of rigid vocabularies and error-prone multi-stage pipelines.

Enables scalable, automated fine-grained recognition using reasoning-augmented LMMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-augmented LMM generates descriptive candidate labels

Vision-language model filters and ranks candidates to form class set

Lightweight multi-modal classifier uses verified names for inference

🔎 Similar Papers

Contextual Emotion Recognition using Large Vision Language Models