Word4Per: Zero-shot Composed Person Retrieval

📅 2023-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing person retrieval methods are limited to unimodal queries (image-only or text-only), failing to meet diverse real-world demands. This paper introduces zero-shot compositional person retrieval (ZS-CPR), a novel cross-modal task that jointly leverages visual and textual cues to localize target individuals without requiring manually annotated image–text query pairs. To address this, we propose Word4Per, a two-stage framework: (1) a lightweight Text Inversion Network (TINet) generates semantically aligned textual embeddings; (2) CLIP is fine-tuned for efficient cross-modal matching. Furthermore, we construct ITCPR—the first fine-grained, human-annotated benchmark for compositional person retrieval. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with Rank-1 and mAP gains exceeding 10% on standard metrics. The code and dataset are publicly released.
📝 Abstract
Searching for specific person has great social benefits and security value, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR requires very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval (ITCPR) dataset is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.
Problem

Research questions and friction points this paper is trying to address.

Lack of annotated datasets for Composed Person Retrieval (CPR).
Need for better representation of composed person queries.
Requirement for objective evaluation of retrieval methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic synthetic data pipeline for dataset generation
Fine-grained Adaptive Feature Alignment framework
Multimodal filtering for high-quality synthetic triplets
🔎 Similar Papers
No similar papers found.
D
Delong Liu
Beijing University of Posts and Telecommunications, Beijing, China
H
Haiwen Li
Beijing University of Posts and Telecommunications, Beijing, China
Zhicheng Zhao
Zhicheng Zhao
Associate Professor at the School of Artificial Intelligence, Anhui University
Computer Vision
F
Fei Su
Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Network System and Network Culture, Beijing, China
Yuan Dong
Yuan Dong
Fudan University; Alibaba
Computer VisionMedical Image ComputingMachine Learning