Saccadic Vision for Fine-Grained Visual Classification

πŸ“… 2025-09-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Fine-grained visual classification (FGVC) remains highly challenging due to large intra-class variation and subtle inter-class discriminative cues. Existing part-based methods rely on complex localization networks, resulting in poor feature transferability, high spatial redundancy, and difficulty in adaptively determining the optimal number of parts. Inspired by human eye movement mechanisms, we propose a two-stage feature extraction framework: (1) peripheral perception generates a sampling map to extract features from attended regions in parallel; (2) context-aware selective attention fuses global and local information, while non-maximum suppression eliminates spatial redundancy. Our approach employs weight-shared encoders and fixed-region parallel encoding, balancing computational efficiency, discriminability, and interpretability. Extensive experiments demonstrate state-of-the-art performance on CUB-200-2011, NABirds, Food-101, Stanford-Dogs, and multiple insect datasets, consistently and significantly outperforming baseline methods.

Technology Category

Application Category

πŸ“ Abstract
Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features - a task that remains challenging due to high intra-class variability and limited inter-class differences. Existing part-based methods often rely on complex localization networks that learn mappings from pixel to sample space, requiring a deep understanding of image content while limiting feature utility for downstream tasks. In addition, sampled points frequently suffer from high spatial redundancy, making it difficult to quantify the optimal number of required parts. Inspired by human saccadic vision, we propose a two-stage process that first extracts peripheral features (coarse view) and generates a sample map, from which fixation patches are sampled and encoded in parallel using a weight-shared encoder. We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations. To prevent spatial collapse - a common issue in part-based methods - we utilize non-maximum suppression during fixation sampling to eliminate redundancy. Comprehensive evaluation on standard FGVC benchmarks (CUB-200-2011, NABirds, Food-101 and Stanford-Dogs) and challenging insect datasets (EU-Moths, Ecuador-Moths and AMI-Moths) demonstrates that our method achieves comparable performance to state-of-the-art approaches while consistently outperforming our baseline encoder.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing visually similar fine-grained categories
Reducing spatial redundancy in part-based methods
Preventing spatial collapse during fixation sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage peripheral and focus feature extraction
Contextualized selective attention for patch weighting
Non-maximum suppression to eliminate spatial redundancy
πŸ”Ž Similar Papers
No similar papers found.