Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitations in video visible-infrared person re-identification, specifically insufficient spatiotemporal modeling, inadequate cross-modal interaction, and the lack of explicit guidance toward modality-invariant representations. To overcome these challenges, we propose Language-driven Sequence-level Modality-invariant Representation Learning (LSMRL), a novel approach built upon the CLIP architecture. LSMRL integrates spatiotemporal feature learning, a semantic diffusion mechanism, and bidirectional cross-modal self-attention, complemented by a specially designed modality-level loss function that explicitly enhances the discriminability and generalizability of modality-invariant features. Extensive experiments on large-scale VVI-ReID datasets demonstrate that LSMRL significantly outperforms state-of-the-art methods, confirming its effectiveness and technical advancement.

Technology Category

Application Category

📝 Abstract

The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features'discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.

Problem

Research questions and friction points this paper is trying to address.

visible-infrared person re-identification

modal-invariant representation

sequence-level learning

cross-modal interaction

language-driven

Innovation

Methods, ideas, or system contributions that make the work stand out.

modal-invariant representation

language-driven learning

cross-modal interaction