🤖 AI Summary
To address the challenge of distinguishing visually similar categories in zero-shot fine-grained image classification, this paper reformulates classification as a visual question answering (VQA) task, leveraging the strong reasoning capabilities of large vision-language models (LVLMs) for label-free discrimination. Our method introduces two key innovations: (1) a lightweight attention intervention mechanism that explicitly enhances LVLM focus on discriminative image regions; and (2) a high-quality, semantically rich benchmark of fine-grained category descriptions, significantly improving prompt quality and generalization. Evaluated under strict zero-shot settings on standard benchmarks—including CUB-200-2011, Stanford Cars, and FGVC-Aircraft—our approach consistently outperforms existing state-of-the-art methods. Results demonstrate both the effectiveness and robustness of the VQA paradigm coupled with attention guidance for fine-grained recognition, establishing a new direction for annotation-free discriminative learning.
📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs' comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification