🤖 AI Summary
To address performance degradation in cross-spectral (SWIR/MWIR/LWIR/RGB) human biometric recognition caused by domain discrepancies, this paper proposes body embedding—replacing conventional face embedding—to significantly enhance re-identification robustness across infrared and visible-light domains. Methodologically, we adopt a Vision Transformer architecture jointly optimized with cross-entropy and triplet losses, trained and evaluated on the IARPA IJB-MDF dataset. Key contributions include: (1) the first systematic validation of body embedding’s superiority for multi-infrared-band cross-domain recognition; (2) the construction of the first benchmark framework supporting four-domain matching; and (3) empirical revelation of strong transferability of vision-language pre-trained models under few-shot infrared fine-tuning. Experiments establish new state-of-the-art mAP on the LLCM dataset and set a novel multi-domain cross-spectral person re-identification benchmark on IJB-MDF, with body embeddings outperforming face embeddings notably in MWIR and LWIR bands.
📝 Abstract
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. In this paper, we show that body embeddings perform better than face embeddings for cross-spectral person identification in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which enables matching of short-wave infrared (SWIR), MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores on the LLCM dataset.