🤖 AI Summary
This work addresses the performance degradation of Vision Transformers (ViTs) in ear recognition caused by the inclusion of non-ear regions in rectangular image patches and the mismatch between ear morphological variability and ViT’s positional sensitivity. To mitigate these issues, the authors propose an anatomy-aware patch alignment method that leverages feature detection to guide local patch-based warping, aligning patch boundaries with anatomical contours of the ear. This ensures that token representations better conform to the ear’s natural curvature. By integrating anatomical knowledge into the ViT preprocessing pipeline—a novel contribution—the method significantly enhances robustness and recognition accuracy across varying ear shapes, sizes, and poses, as demonstrated on multiple ViT architectures (ViT-T/S/B/L).
📝 Abstract
The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.