🤖 AI Summary
Vision Transformers (ViTs) applied to ear recognition suffer from fine-grained detail loss due to non-overlapping patch partitioning. Method: This work proposes an overlap-aware ViT optimization framework, systematically investigating patch size (28×28) and stride (14) to enable overlapping image patches—demonstrating for the first time the critical role of overlap in modeling ear-specific fine-grained features. Experiments are conducted on standardized 112×112 ear images. Contribution/Results: A lightweight ViT-T variant outperforms ViT-S, ViT-B, and ViT-L across three benchmarks—AWE, WPUT, and our newly introduced EarVN1.0 dataset. In 48 ablation and comparative experiments, the overlapping strategy achieves top performance in 44 cases; notably, recognition accuracy on EarVN1.0 improves by up to 10%. This work establishes overlapping patching as an effective paradigm for ViT-based ear recognition and offers a novel direction for lightweight biometric identification.
📝 Abstract
Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.