🤖 AI Summary
Current large vision-language models (LVLMs) exhibit limited performance on fine-grained recognition tasks, and existing benchmarks predominantly emphasize reasoning capabilities while lacking open-world, fine-grained evaluation protocols.
Method: We introduce FROW—the first open-world benchmark for fine-grained visual recognition—and leverage GPT-4o for automated, high-quality data synthesis and evaluation. Our approach integrates mosaic responses (multi-granularity answer fusion) and open-world question sampling to jointly optimize data construction and training. We further propose a fine-grained-aware pretraining paradigm.
Contribution/Results: Experiments demonstrate that mosaic data improves class accuracy by 1%; open-world data boosts FROW accuracy by 10–20% and content accuracy by 6–12%; and fine-grained pretraining achieves up to a 10% gain in class recognition accuracy. This work establishes a new benchmark, methodology, and empirical foundation for evaluating and enhancing LVLMs’ fine-grained perception capabilities.
📝 Abstract
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: extit{data construction} and extit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.