UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address overfitting and degraded generalization caused by full fine-tuning of CLIP in text-to-person retrieval (TPR), this paper proposes a unified parameter-efficient transfer learning (PETL) framework. The method introduces a novel three-module collaborative architecture—comprising Prefix, LoRA, and Adapter—unified for TPR. It further designs S-Prefix to enhance prefix gradient propagation and L-Adapter to mitigate inter-module interference, enabling joint optimization of local prompts and global representations. Using only 4.7% trainable parameters, the approach achieves state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid, significantly outperforming both full fine-tuning and existing PETL methods. This work represents the first systematic integration and enhancement of the three dominant PETL techniques—Prefix tuning, LoRA, and Adapter—in the TPR domain, effectively balancing knowledge transfer efficiency and task-specific adaptation.

Technology Category

Application Category

📝 Abstract
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
Problem

Research questions and friction points this paper is trying to address.

Efficient transfer learning for text-based person retrieval
Overcoming overfitting in large model fine-tuning
Enhancing multi-modal knowledge transfer from CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified PETL method for text-based person retrieval
Integrates Prefix, LoRA, and Adapter for efficiency
Optimizes S-Prefix and L-Adapter for better performance
🔎 Similar Papers
No similar papers found.
Y
Yating Liu
Shenzhen International Graduate School, Tsinghua University, Shenzhen 518071, China and Peng Cheng Laboratory, Shenzhen 518071, China
Yaowei Li
Yaowei Li
Peking University
Computer VisionGenerative Models3D VisionMulti-modal Processing
Xiangyuan Lan
Xiangyuan Lan
Pengcheng Laboratory
Multimodal LLMPlace RecognitionVisual TrackingPerson Re-identificationObject Detection
Wenming Yang
Wenming Yang
Tsinghua University
Computer VisionImage Processing
Zimo Liu
Zimo Liu
pcl.ac.cn
computer visiondeep learningmachine learningartificial intelligence
Q
Qingmin Liao
Shenzhen International Graduate School, Tsinghua University, Shenzhen 518071, China