🤖 AI Summary
This study addresses the challenge of extreme long-range video person re-identification, where performance of conventional models deteriorates significantly due to scale compression, low resolution, motion blur, and domain gaps between aerial and ground viewpoints. To tackle this, the work presents the first effective adaptation of the large-scale vision-language model CLIP to this task, leveraging a ViT-L/14 backbone. The proposed approach integrates backbone-aware selective fine-tuning, a lightweight temporal attention pooling mechanism, adapter- and prompt-conditioned cross-view learning, and an enhanced k-reciprocal re-ranking strategy. Evaluated on the DetReIDX benchmark, the method substantially outperforms existing approaches, achieving mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), with an overall mAP of 35.73.
📝 Abstract
Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.