Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Fine-grained image recognition (FGIR) has long relied on large-scale pre-trained models, hindering deployment in resource-constrained settings and impeding task-specific architecture design. To address this, we propose TGDA—the first fully from-scratch FGIR framework—eliminating pre-training dependencies via fine-grained-aware teacher-guided data augmentation, knowledge distillation, and weakly supervised learning. We further introduce two lightweight, task-specialized architectures: LRNets, optimized for low-resolution inputs, and ViTFS, an efficient Vision Transformer variant tailored for FGIR. Experiments demonstrate that LRNets achieve a 23% accuracy gain while reducing parameters by 20.6×; ViTFS-T attains competitive or superior performance to state-of-the-art pre-trained methods on CUB and Stanford Cars benchmarks, despite requiring orders-of-magnitude less training data. Our work establishes a new paradigm for pre-training-free FGIR, advancing both efficiency and architectural specialization.

Technology Category

Application Category

📝 Abstract

Fine-grained image recognition (FGIR) aims to distinguish visually similar sub-categories within a broader class, such as identifying bird species. While most existing FGIR methods rely on backbones pretrained on large-scale datasets like ImageNet, this dependence limits adaptability to resource-constrained environments and hinders the development of task-specific architectures tailored to the unique challenges of FGIR. In this work, we challenge the conventional reliance on pretrained models by demonstrating that high-performance FGIR systems can be trained entirely from scratch. We introduce a novel training framework, TGDA, that integrates data-aware augmentation with weak supervision via a fine-grained-aware teacher model, implemented through knowledge distillation. This framework unlocks the design of task-specific and hardware-aware architectures, including LRNets for low-resolution FGIR and ViTFS, a family of Vision Transformers optimized for efficient inference. Extensive experiments across three FGIR benchmarks over diverse settings involving low-resolution and high-resolution inputs show that our method consistently matches or surpasses state-of-the-art pretrained counterparts. In particular, in the low-resolution setting, LRNets trained with TGDA improve accuracy by up to 23% over prior methods while requiring up to 20.6x less parameters, lower FLOPs, and significantly less training data. Similarly, ViTFS-T can match the performance of a ViT B-16 pretrained on ImageNet-21k while using 15.3x fewer trainable parameters and requiring orders of magnitudes less data. These results highlight TGDA's potential as an adaptable alternative to pretraining, paving the way for more efficient fine-grained vision systems.

Problem

Research questions and friction points this paper is trying to address.

Enabling fine-grained image recognition without pretrained models

Reducing resource dependency for task-specific FGIR architectures

Improving accuracy and efficiency in low-resolution FGIR settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-guided data augmentation for FGIR

Task-specific architectures like LRNets

Vision Transformers optimized for efficiency

🔎 Similar Papers

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization