Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of unsupervised cross-scenario person re-identification—such as cross-resolution and clothing-changing settings—by proposing an Image-Text Knowledge Modeling (ITKM) framework. Built upon CLIP, ITKM integrates scene embeddings, a multi-scenario separation loss, and a dynamic text representation update mechanism through a three-stage pipeline, enabling unified handling of diverse heterogeneous scenarios without requiring labeled data. As the first approach to unify multiple cross-scenario re-identification tasks within a single unsupervised framework, ITKM demonstrates significant performance gains over existing methods, achieving both stronger generalization and consistent improvements across various cross-scenario benchmarks.

Technology Category

Application Category

📝 Abstract

We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.

Problem

Research questions and friction points this paper is trying to address.

unsupervised

multi-scenario

person re-identification

cross-resolution

clothing change

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Multi-Scenario ReID

Image-Text Knowledge Modeling

Vision-Language Model