🤖 AI Summary
This work addresses the limitation of existing vision-language models in fine-grained classification, where predictions at leaf nodes are often correct but inconsistent with their parent categories due to a lack of hierarchical reasoning. To resolve this, the authors propose VL-Taxon, a novel framework that explicitly enforces hierarchical consistency during both training and inference. The approach operates in two stages: first, a top-down strategy enhances leaf-node classification accuracy; second, supervised fine-tuning combined with reinforcement learning ensures logical coherence across the entire taxonomic hierarchy. Evaluated on a small-scale subset of iNaturalist-2021 using Qwen2.5-VL-7B, VL-Taxon achieves an average improvement of over 10% in both leaf-node accuracy and hierarchical consistency—outperforming the original 72B model—without relying on externally generated data.
📝 Abstract
While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model's reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.