DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

📅 2025-10-27
🏛️ Proceedings of the 33rd ACM International Conference on Multimedia
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient cross-modal spatiotemporal consistency modeling in Video-based Visible-Infrared Person Re-Identification (VVI-ReID), this paper proposes a cross-modal sequence matching framework that jointly leverages appearance and gait features. We introduce DINOv2 visual priors for the first time to guide gait representation learning; design a Semantic-Aware Silhouette Gait Learning (SASGL) module to extract robust dynamic silhouette features; and propose a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module to achieve dynamic, complementary fusion of appearance and gait cues. Integrating cross-modal sequence modeling with joint optimization, our method achieves significant improvements over state-of-the-art approaches on the HITSZ-VCM and BUPT benchmarks. Ablation studies and qualitative analyses validate its effectiveness in enhancing modality invariance, temporal dynamic modeling, and cross-modal matching robustness.

Technology Category

Application Category

📝 Abstract
Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Retrieving same pedestrian across visible-infrared video sequences
Overlooking modality-invariant gait features in cross-modal matching
Learning robust spatiotemporal representations for cross-modal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages DINOv2 visual priors for gait representation learning
Introduces semantic-aware silhouette and gait learning model
Develops progressive bidirectional multi-granularity enhancement module
🔎 Similar Papers
No similar papers found.
Y
Yujie Yang
Kunming University of Science and Technology, Faculty of Information Engineering and Automation, Kunming, Yunnan, China
S
Shuang Li
Chongqing University of Post and Telecommunications, School of Computer Science and Technology, Chongqing, Chongqing, China
J
Jun Ye
China University of Mining Technology, School of Information and Control Engineering, Xuzhou, Jiangsu, China
Neng Dong
Neng Dong
Nanjing University of Science and Technology
F
Fan Li
Kunming University of Science and Technology, Faculty of Information Engineering and Automation, Kunming, Yunnan, China
Huafeng Li
Huafeng Li
KUST
Computer VisionPattern RecognitionMachine Learning