Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) lack systematic evaluation on fine-grained image understanding tasks. Method: We introduce FG-BMK—the first large-scale fine-grained benchmark comprising 3.49M questions and 3.32M images—and propose a multidimensional evaluation framework grounded in dual human- and model-centric perspectives, covering semantic recognition, feature representation, perturbation robustness, and hierarchical reasoning. We conduct cross-model experiments across eight state-of-the-art LVLMs/VLMs, integrating automated metrics, human evaluation, controlled image perturbations, and category-level hierarchy analysis. Contribution/Results: Quantitative results reveal significant bottlenecks in existing LVLMs’ fine-grained recognition capabilities, primarily constrained by suboptimal vision-language alignment quality and training paradigms. This work fills a critical gap in fine-grained visual evaluation, providing a reproducible diagnostic toolkit and actionable pathways for model improvement.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on fine-grained image tasks lacking exploration
Assessing semantic recognition and feature representation in LVLMs
Analyzing training paradigms and modality alignment impacts on performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FG-BMK benchmark for fine-grained evaluation
Evaluates LVLMs from human and machine perspectives
Analyzes training paradigms and modality alignment impacts
🔎 Similar Papers
No similar papers found.
H
Hong-Tao Yu
School of Computer Science and Engineering, Southeast University, China
Xiu-Shen Wei
Xiu-Shen Wei
Professor, Southeast University
Computer VisionMachine LearningArtificial Intelligence
Y
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University, China
Serge Belongie
Serge Belongie
University of Copenhagen
Computer VisionMachine Learning