🤖 AI Summary
Current large vision-language models (LVLMs) lack systematic evaluation on fine-grained image understanding tasks.
Method: We introduce FG-BMK—the first large-scale fine-grained benchmark comprising 3.49M questions and 3.32M images—and propose a multidimensional evaluation framework grounded in dual human- and model-centric perspectives, covering semantic recognition, feature representation, perturbation robustness, and hierarchical reasoning. We conduct cross-model experiments across eight state-of-the-art LVLMs/VLMs, integrating automated metrics, human evaluation, controlled image perturbations, and category-level hierarchy analysis.
Contribution/Results: Quantitative results reveal significant bottlenecks in existing LVLMs’ fine-grained recognition capabilities, primarily constrained by suboptimal vision-language alignment quality and training paradigms. This work fills a critical gap in fine-grained visual evaluation, providing a reproducible diagnostic toolkit and actionable pathways for model improvement.
📝 Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.