Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) suffer from insufficient discriminative capability in fine-grained visual recognition (FGVR) due to semantic misalignment between visual objects and category names. To address this, we propose an attribute-augmented contrastive alignment method. Our approach introduces a novel dual-path attribute-level contrastive learning framework—object–attribute and attribute–category—coupled with hard negative mining to achieve precise alignment between visual representations and the fine-grained semantic space. Crucially, the method requires no additional annotations; instead, it leverages only textual attribute descriptions to significantly enhance model sensitivity to subtle visual distinctions and rare subcategories. Extensive experiments on mainstream FGVR benchmarks—including CUB-200 and Stanford Cars—demonstrate consistent and substantial improvements over MLLMs of comparable scale. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Image Detail Recognition

Complex Image Queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Finedefics

Attention-based Learning

Fine-grained Image Recognition

🔎 Similar Papers

Law of Vision Representation in MLLMs