Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models

📅 2024-02-26
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) exhibit significant limitations in fine-grained visual classification (FGVC): weak sensitivity to image-level details and semantic misalignment between visual and textual modalities—e.g., a 65.58% average drop in exact match (EM) on Stanford Dogs. To address this, we propose Finer, the first attribute-oriented, multi-granularity evaluation and training paradigm for FGVC. Finer establishes a unified benchmark and introduces an instruction-tuning framework that jointly incorporates fine-grained attribute annotations, cross-modal alignment objectives, and multi-scale visual prompting. Evaluated on six FGVC benchmarks, Finer substantially improves classification accuracy and attribute generation quality of state-of-the-art LVLMs (e.g., LLaVA-1.5), while enhancing zero-shot fine-grained discrimination and model interpretability. This work establishes a new paradigm for precise, semantically grounded visual understanding in LVLMs.

Technology Category

Application Category

📝 Abstract
Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs such as LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate descriptive visual attributes based on a concept that appears within an input image despite their prominent zero-shot image captioning ability. In-depth analyses show that instruction-tuned LVLMs suffer from modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept. In an effort to further the community’s endeavor in this direction, we propose a multiple granularity attribute-centric benchmark and training mixture, Finer, which aims to establish a ground to evaluate LVLMs’ fine-grained visual comprehension ability and provide significantly improved explainability.
Problem

Research questions and friction points this paper is trying to address.

Large-scale Visual Language Models
Detail Recognition
Caption Utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finer
Visual Language Models Evaluation
Multimodal Integration Bias
🔎 Similar Papers
No similar papers found.