Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

📅 2025-06-03

🏛️ IEEE Transactions on Information Forensics and Security

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses the significant modality gap and semantic misalignment challenges in video-level visible-infrared cross-modal person re-identification (VVI-ReID). To this end, we propose a video-level language-driven framework, VLD. Our method introduces two key components: (1) Invariant-Modality Language Prompts (IMLP), which generate modality-agnostic, video-level textual descriptions; and (2) a Spatio-Temporal Prompting (STP) module that enables fine-grained spatio-temporal alignment between linguistic and visual features within the CLIP embedding space. VLD integrates a ViT backbone, spatial-temporal Hub/Aggregation architecture, identity-level loss, and multi-head attention, and is jointly fine-tuned with CLIP. Evaluated on two mainstream VVI-ReID benchmarks, VLD achieves state-of-the-art performance, substantially mitigating both the semantic gap and feature shift between visible-light and infrared modalities.

Technology Category

Application Category

📝 Abstract

Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.

Problem

Research questions and friction points this paper is trying to address.

Matching pedestrian sequences across visible and infrared modalities

Generating modality-shared video-level language prompts to reduce modality gaps

Incorporating spatiotemporal information into text prompts for better feature alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CLIP model for video-level language prompts

Introduces invariant-modality language prompting (IMLP)

Incorporates spatiotemporal info via spatial-temporal prompting (STP)

🔎 Similar Papers

No similar papers found.