🤖 AI Summary
This work addresses the long-overlooked trade-off between accuracy and computational cost in fine-grained image recognition. Through over 2,000 experiments, we systematically evaluate six training/evaluation protocols, nine pretrained backbones, and seventeen datasets, revealing that high accuracy can be achieved using data-aware augmentations solely during training—without resorting to computationally expensive multi-crop inference. We extend the CAL augmentation method by introducing cross-image discriminative region mixing and propose an efficient evaluation variant that substantially reduces inference overhead. Experimental results demonstrate that our approach maintains competitive performance while significantly lowering computational costs. Code and models are publicly released.
📝 Abstract
Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}