Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of unsupervised video-level visible-infrared person re-identification (VI-ReID), where identity labels are unavailable yet temporal information must be leveraged for effective cross-modal matching. To this end, we propose HiTPro, a novel framework that employs a temporal-aware encoder to extract and aggregate frame-level features into intra-camera tracklet prototypes. HiTPro introduces, for the first time in unsupervised video VI-ReID, a prototype-driven paradigm that operates without hard pseudo-labels. It integrates hierarchical cross-prototype alignment, dynamic-threshold soft weight assignment, and a three-tier contrastive learning strategy encompassing intra-camera discriminability, inter-camera intra-modality consistency, and cross-modal invariance. Extensive experiments demonstrate that HiTPro significantly outperforms existing unsupervised methods on the HITSZ-VCM and BUPTCampus benchmarks, establishing a new state-of-the-art and a strong baseline for this task.

Technology Category

Application Category

📝 Abstract

Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

Problem

Research questions and friction points this paper is trying to address.

unsupervised learning

video-based person re-identification

visible-infrared re-identification

cross-modality matching

tracklet-level representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Prototyping

Hierarchical Alignment

Unsupervised Video-based VI-ReID