🤖 AI Summary
This paper addresses the significant modality gap and semantic misalignment challenges in video-level visible-infrared cross-modal person re-identification (VVI-ReID). To this end, we propose a video-level language-driven framework, VLD. Our method introduces two key components: (1) Invariant-Modality Language Prompts (IMLP), which generate modality-agnostic, video-level textual descriptions; and (2) a Spatio-Temporal Prompting (STP) module that enables fine-grained spatio-temporal alignment between linguistic and visual features within the CLIP embedding space. VLD integrates a ViT backbone, spatial-temporal Hub/Aggregation architecture, identity-level loss, and multi-head attention, and is jointly fine-tuned with CLIP. Evaluated on two mainstream VVI-ReID benchmarks, VLD achieves state-of-the-art performance, substantially mitigating both the semantic gap and feature shift between visible-light and infrared modalities.
📝 Abstract
Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.