Graph-Based Cross-Domain Knowledge Distillation for Cross-Dataset Text-to-Image Person Retrieval

πŸ“… 2025-01-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses cross-dataset text-to-image person retrieval under unsupervised domain adaptation, where no labels are available in the target domain. Method: We propose a graph-structured guided cross-domain knowledge distillation framework. It constructs a cross-domain heterogeneous graph to model inter-modal (text–image) and inter-domain relationships, designs a graph-driven multimodal message propagation mechanism, and introduces a contrastive momentum-based knowledge distillation module for end-to-end adaptation without target-domain annotations. Contribution/Results: By innovatively integrating graph neural networks, online knowledge distillation, and momentum-updated feature queues, our method achieves significant improvements over state-of-the-art approaches on three public benchmarks. Extensive experiments demonstrate its robustness and efficiency in unsupervised cross-domain cross-modal retrieval.

Technology Category

Application Category

πŸ“ Abstract
Video surveillance systems are crucial components for ensuring public safety and management in smart city. As a fundamental task in video surveillance, text-to-image person retrieval aims to retrieve the target person from an image gallery that best matches the given text description. Most existing text-to-image person retrieval methods are trained in a supervised manner that requires sufficient labeled data in the target domain. However, it is common in practice that only unlabeled data is available in the target domain due to the difficulty and cost of data annotation, which limits the generalization of existing methods in practical application scenarios. To address this issue, we propose a novel unsupervised domain adaptation method, termed Graph-Based Cross-Domain Knowledge Distillation (GCKD), to learn the cross-modal feature representation for text-to-image person retrieval in a cross-dataset scenario. The proposed GCKD method consists of two main components. Firstly, a graph-based multi-modal propagation module is designed to bridge the cross-domain correlation among the visual and textual samples. Secondly, a contrastive momentum knowledge distillation module is proposed to learn the cross-modal feature representation using the online knowledge distillation strategy. By jointly optimizing the two modules, the proposed method is able to achieve efficient performance for cross-dataset text-to-image person retrieval. acExtensive experiments on three publicly available text-to-image person retrieval datasets demonstrate the effectiveness of the proposed GCKD method, which consistently outperforms the state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Video Surveillance
Text-to-Image Retrieval
Unlabeled Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

GCKD
Unsupervised Learning
Cross-Dataset Person Matching
πŸ”Ž Similar Papers
No similar papers found.
B
Bingjun Luo
BNRist, KLISS, and School of Software, Tsinghua University
J
Jinpeng Wang
BNRist, KLISS, and School of Software, Tsinghua University
W
Wang Zewen
BNRist, KLISS, and School of Software, Tsinghua University
Junjie Zhu
Junjie Zhu
Shanghai Jiao Tong University
Intrinsically Disordered ProteinsGenerative ModelEnhanced Sampling
X
Xibin Zhao
BNRist, KLISS, and School of Software, Tsinghua University