Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In text-driven person retrieval, CLIP-based methods suffer from weak fine-grained feature learning and high sensitivity to textual noise, primarily due to the lack of person-centric training data and insufficient global contrastive learning. To address these issues, this work proposes a novel co-optimization paradigm for data and model. First, leveraging context learning in multimodal large language models (MLLMs), we design a denoising data curation pipeline and release WebPerson—a large-scale, person-centric dataset. Second, we introduce GA-DMS, a gradient-attention-guided dual masking framework that adaptively masks noisy tokens and jointly optimizes masked token prediction with image-text contrastive learning. Extensive experiments demonstrate state-of-the-art performance on benchmarks including CUHK-PEDES and RSTPReid, achieving significant gains in retrieval accuracy and robustness against textual noise.

Technology Category

Application Category

📝 Abstract
Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of large-scale annotated person-centric vision-language data
Overcomes limitations of global contrastive learning for fine-grained matching
Reduces vulnerability to noisy text tokens in cross-modal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated MLLM-based data curation pipeline
Gradient-attention guided adaptive text masking
Masked token prediction for fine-grained learning
🔎 Similar Papers
No similar papers found.
T
Tianlu Zheng
Northeastern University
Y
Yifan Zhang
South China University of Technology
Xiang An
Xiang An
DeepGlint
Computer Vision
Z
Ziyong Feng
DeepGlint
Kaicheng Yang
Kaicheng Yang
DeepGlint
Multimodal、CV、NLP
Q
Qichuan Ding
Northeastern University