🤖 AI Summary
Text-to-image person search commonly relies on web-crawled image-text pairs for dataset construction, yet these pairs often suffer from severe semantic misalignment noise, significantly degrading retrieval performance. To address this, we propose the Dynamic Uncertainty and Relation Alignment (DURA) framework. DURA is the first to model cross-modal similarity evidence as a Dirichlet distribution, explicitly capturing matching uncertainty. It incorporates a Key Feature Selector for fine-grained feature selection and introduces a Dynamic Softmax Hinge Loss that jointly optimizes dynamic hard-negative weighting and bidirectional cross-modal alignment. Evaluated on three benchmark datasets, DURA achieves state-of-the-art retrieval accuracy under both low- and high-noise conditions, demonstrating substantial improvements in model robustness and noise resilience.
📝 Abstract
Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.