🤖 AI Summary
This work addresses the limited generalization of person re-identification models to unseen domains caused by cross-camera viewpoint variations. To this end, we propose the first multimodal joint learning framework tailored for image-based person re-identification. Our approach integrates multi-camera image data with single-camera image–text pairs and jointly optimizes three objectives: person re-identification, image–text matching, and text-guided image reconstruction. This synergistic training strategy effectively enriches the semantic representation of single-camera data and mitigates domain shift. Extensive experiments on multiple cross-domain person re-identification benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving superior generalization performance.
📝 Abstract
Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.