Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the limitations of conventional face and iris recognition in uncontrolled surveillance scenarios by proposing a robust periocular identification framework that leverages video sequences. The method employs a deep convolutional network to extract frame-level features, followed by an encoder-only Transformer architecture to adaptively aggregate temporal information. Crucially, an attention mechanism dynamically weights informative frames, enhancing recognition robustness under challenging environmental conditions. Evaluated on the COX Face dataset, the proposed approach achieves a true positive rate of 99.8% at a false positive rate of 1e⁻¹ and a Rank-5 identification rate of 96.6%, significantly outperforming existing temporal aggregation strategies.

📝 Abstract

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

Problem

Research questions and friction points this paper is trying to address.

video periocular recognition

surveillance

biometric recognition

unconstrained conditions

identity verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-aware

transformer-based aggregation

video periocular recognition