🤖 AI Summary
This work addresses the challenge of ultra-fine-grained visual classification under limited sample availability, where existing methods struggle to effectively model holistic discriminative cues among highly similar categories. To overcome this, we propose the Divide-and-Holistic Cognition Network (DHCNet), which innovatively decomposes holistic cues into subtle local differences with spatial topological relationships. DHCNet employs a local-region self-shuffling strategy and an online holistic cue refinement mechanism to enable progressive modeling from local to global representations. Notably, our approach requires no additional annotations and significantly reduces reliance on large training datasets. Extensive experiments on five mainstream ultra-fine-grained benchmarks demonstrate that DHCNet achieves state-of-the-art performance, validating its effectiveness in scenarios characterized by high inter-class similarity and scarce samples.
📝 Abstract
Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.