🤖 AI Summary
This work addresses the limitations of existing training-free fine-grained visual recognition methods, which suffer from suboptimal accuracy and efficiency due to their neglect of inter-sample difficulty variations and inability to reuse past failure experiences. To overcome these issues, we propose a sample-adaptive inference framework that integrates rapid candidate retrieval with fine-grained reasoning, activating costly inference only when necessary. Our approach introduces a training-free introspective mechanism that dynamically leverages historical failure cases to provide discriminative guidance. Built upon large vision-language models, the method employs a cascaded retrieval–reasoning architecture combined with a reflexive experience-guided strategy, requiring no parameter updates throughout. Extensive experiments demonstrate state-of-the-art performance across 14 benchmark datasets while significantly reducing computational overhead.
📝 Abstract
Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.