Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations in video visible-infrared person re-identification, specifically insufficient spatiotemporal modeling, inadequate cross-modal interaction, and the lack of explicit guidance toward modality-invariant representations. To overcome these challenges, we propose Language-driven Sequence-level Modality-invariant Representation Learning (LSMRL), a novel approach built upon the CLIP architecture. LSMRL integrates spatiotemporal feature learning, a semantic diffusion mechanism, and bidirectional cross-modal self-attention, complemented by a specially designed modality-level loss function that explicitly enhances the discriminability and generalizability of modality-invariant features. Extensive experiments on large-scale VVI-ReID datasets demonstrate that LSMRL significantly outperforms state-of-the-art methods, confirming its effectiveness and technical advancement.

Technology Category

Application Category

📝 Abstract
The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features'discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.
Problem

Research questions and friction points this paper is trying to address.

visible-infrared person re-identification
modal-invariant representation
sequence-level learning
cross-modal interaction
language-driven
Innovation

Methods, ideas, or system contributions that make the work stand out.

modal-invariant representation
language-driven learning
cross-modal interaction
spatial-temporal modeling
visible-infrared re-identification
🔎 Similar Papers
No similar papers found.
X
Xiaomei Yang
Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan 250022, China
X
Xizhan Gao
Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan 250022, China
A
Antai Liu
Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan 250022, China
K
Kang Wei
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Fa Zhu
Fa Zhu
Nanjing Forestry University
pattern recognitionmachine learning
Guang Feng
Guang Feng
University of Jinan
deep learningreferring image segmentationsaliency detection
X
Xiaofeng Qu
Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan 250022, China
Sijie Niu
Sijie Niu
University of Jinan
Medical Image ComputingPattern Recognition