Improved Ear Verification with Vision Transformers and Overlapping Patches

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) applied to ear recognition suffer from fine-grained detail loss due to non-overlapping patch partitioning. Method: This work proposes an overlap-aware ViT optimization framework, systematically investigating patch size (28×28) and stride (14) to enable overlapping image patches—demonstrating for the first time the critical role of overlap in modeling ear-specific fine-grained features. Experiments are conducted on standardized 112×112 ear images. Contribution/Results: A lightweight ViT-T variant outperforms ViT-S, ViT-B, and ViT-L across three benchmarks—AWE, WPUT, and our newly introduced EarVN1.0 dataset. In 48 ablation and comparative experiments, the overlapping strategy achieves top performance in 44 cases; notably, recognition accuracy on EarVN1.0 improves by up to 10%. This work establishes overlapping patching as an effective paradigm for ViT-based ear recognition and offers a novel direction for lightweight biometric identification.

Technology Category

Application Category

📝 Abstract
Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing ear recognition using Vision Transformers with overlapping patches
Evaluating ViT configurations for improved biometric verification accuracy
Optimizing patch-stride settings to capture intricate ear features effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Transformers for ear recognition
Implements overlapping patch selection strategy
Optimizes patch size and stride configuration
🔎 Similar Papers
No similar papers found.
D
Deeksha Arun
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556
Kagan Ozturk
Kagan Ozturk
University of Notre Dame
BiometricsComputer VisionDeep Learning
Kevin W. Bowyer
Kevin W. Bowyer
Schubmehl-Prein Family Professor of Computer Science and Engineering, University of Notre Dame
BiometricsPattern RecognitionComputer VisionData Mining
P
Patrick Flynn
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556