🤖 AI Summary
Few-shot fine-grained image classification (FS-FGIC) confronts dual challenges: severe label scarcity and high inter-subclass visual similarity. Existing metric-based methods neglect spatial structural cues, while reconstruction-based approaches lack hierarchical feature exploitation and discriminative region focusing mechanisms. To address these limitations, we propose the Dual-Reconstruction and Mask-Augmented Network (DRMAN). Its core contributions are: (1) a hierarchical feature reconstruction module jointly modeling mid-level structural patterns and high-level semantics; (2) a learnable threshold-based adaptive binary spatial mask that precisely highlights discriminative regions while suppressing background interference; and (3) a Transformer-driven self-reconstruction architecture with a weighted feature fusion strategy. DRMAN achieves state-of-the-art performance on CUB, FGVC-Aircraft, and Stanford-Cars. Ablation studies confirm that each component critically enhances inter-class separability and reduces intra-class variation.
📝 Abstract
Few-shot fine-grained image classification (FS-FGIC) presents a significant challenge, requiring models to distinguish visually similar subclasses with limited labeled examples. Existing methods have critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods fail to utilize hierarchical feature information and lack mechanisms to focus on discriminative regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), which integrates dual-layer feature reconstruction with mask-enhanced feature processing to improve fine-grained classification. HMDRN incorporates a dual-layer feature reconstruction and fusion module that leverages complementary visual information from different network hierarchies. Through learnable fusion weights, the model balances high-level semantic representations from the last layer with mid-level structural details from the penultimate layer. Additionally, we design a spatial binary mask-enhanced transformer self-reconstruction module that processes query features through adaptive thresholding while maintaining complete support features, enhancing focus on discriminative regions while filtering background noise. Extensive experiments on three challenging fine-grained datasets demonstrate that HMDRN consistently outperforms state-of-the-art methods across Conv-4 and ResNet-12 backbone architectures. Comprehensive ablation studies validate the effectiveness of each proposed component, revealing that dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations. Visualization results provide evidence of HMDRN's superior feature reconstruction capabilities.