Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current large vision-language models (LVLMs) exhibit limited performance on fine-grained recognition tasks, and existing benchmarks predominantly emphasize reasoning capabilities while lacking open-world, fine-grained evaluation protocols. Method: We introduce FROW—the first open-world benchmark for fine-grained visual recognition—and leverage GPT-4o for automated, high-quality data synthesis and evaluation. Our approach integrates mosaic responses (multi-granularity answer fusion) and open-world question sampling to jointly optimize data construction and training. We further propose a fine-grained-aware pretraining paradigm. Contribution/Results: Experiments demonstrate that mosaic data improves class accuracy by 1%; open-world data boosts FROW accuracy by 10–20% and content accuracy by 6–12%; and fine-grained pretraining achieves up to a 10% gain in class recognition accuracy. This work establishes a new benchmark, methodology, and empirical foundation for evaluating and enhancing LVLMs’ fine-grained perception capabilities.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: extit{data construction} and extit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1% and open-world data boosts FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10%. The benchmark will be available at https://github.com/pc-inno/FROW.

Problem

Research questions and friction points this paper is trying to address.

Addresses fine-grained recognition gaps in Large Vision Language Models

Introduces FROW benchmark for detailed LVLM evaluation with GPT-4o

Proposes data and training optimization strategies to enhance LVLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FROW benchmark for fine-grained LVLM evaluation

Proposes mosaic and open-world data construction strategies

Enhances training with fine-grained data in pre-training phase

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling