🤖 AI Summary
In text-based person retrieval, a significant domain gap exists between synthetic pretraining data and real-world scenarios—manifesting in discrepancies in illumination, color distribution, and viewpoint. To address this, we propose a joint image-level and region-level domain adaptation framework. Methodologically: (1) we introduce a domain-aware diffusion mechanism to perform image-level style transfer, aligning global feature distributions; (2) we design a multi-granularity relational alignment module that models cross-modal semantic associations between fine-grained textual phrases and corresponding visual regions. This dual-level adaptation strategy effectively bridges the domain shift between pretraining and downstream fine-tuning. Our approach achieves state-of-the-art performance on three major benchmarks—CUHK-PEDES, ICFG-PEDES, and RSTPReid—outperforming existing methods by substantial margins. Comprehensive experiments validate both the effectiveness and generalizability of the proposed joint domain adaptation paradigm.
📝 Abstract
In this work, we focus on text-based person retrieval, which aims to identify individuals based on textual descriptions. Given the significant privacy issues and the high cost associated with manual annotation, synthetic data has become a popular choice for pretraining models, leading to notable advancements. However, the considerable domain gap between synthetic pretraining datasets and real-world target datasets, characterized by differences in lighting, color, and viewpoint, remains a critical obstacle that hinders the effectiveness of the pretrain-finetune paradigm. To bridge this gap, we introduce a unified text-based person retrieval pipeline considering domain adaptation at both image and region levels. In particular, it contains two primary components, i.e., Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the name implies, Domain-aware Diffusion is to migrate the distribution of images from the pretraining dataset domain to the target real-world dataset domain, e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level alignment by establishing correspondences between visual regions and their descriptive sentences, thereby addressing disparities at a finer granularity. Extensive experiments show that our dual-level adaptation method has achieved state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, outperforming existing methodologies. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.