Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large vision-language models (LVLMs) lack systematic evaluation on fine-grained image understanding tasks. Method: We introduce FG-BMK—the first large-scale fine-grained benchmark comprising 3.49M questions and 3.32M images—and propose a multidimensional evaluation framework grounded in dual human- and model-centric perspectives, covering semantic recognition, feature representation, perturbation robustness, and hierarchical reasoning. We conduct cross-model experiments across eight state-of-the-art LVLMs/VLMs, integrating automated metrics, human evaluation, controlled image perturbations, and category-level hierarchy analysis. Contribution/Results: Quantitative results reveal significant bottlenecks in existing LVLMs’ fine-grained recognition capabilities, primarily constrained by suboptimal vision-language alignment quality and training paradigms. This work fills a critical gap in fine-grained visual evaluation, providing a reproducible diagnostic toolkit and actionable pathways for model improvement.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on fine-grained image tasks lacking exploration

Assessing semantic recognition and feature representation in LVLMs

Analyzing training paradigms and modality alignment impacts on performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FG-BMK benchmark for fine-grained evaluation

Evaluates LVLMs from human and machine perspectives

Analyzes training paradigms and modality alignment impacts

🔎 Similar Papers

No similar papers found.

Authors to Follow