Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-Image Person Retrieval (TIPR) suffers from insufficient cross-modal alignment: existing methods over-rely on hard negative mining while neglecting unmatched positive samples and lack explicit modeling and verification of local feature alignment. To address this, we propose the Full-Mode Fine-grained Alignment (FMFA) framework, which achieves synergistic optimization through three innovations: (1) Adaptive Similarity Distribution Matching (A-SDM), which calibrates global semantic distributions; (2) an Explicit Fine-grained Alignment (EFA) module that integrates sparsified similarity matrices with hard-coded local constraints to strengthen part-level correspondences; and (3) an implicit relational reasoning mechanism enabling end-to-end joint training. Evaluated on CUHK-PEDES, RSTPReid, and ICFG-PEDES benchmarks, FMFA significantly outperforms existing global-matching approaches, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning -- hence the term ``full-mode" -- without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.
Problem

Research questions and friction points this paper is trying to address.

Achieving effective cross-modal alignment between text and images
Verifying correct alignment of all local features in retrieval
Addressing incorrectly matched positive pairs during model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit fine-grained alignment without extra supervision
Adaptive similarity distribution matching for unmatched pairs
Sparsified similarity matrix with hard coding
🔎 Similar Papers
No similar papers found.
Hao Yin
Hao Yin
Meta Platforms Inc.
Wireless communicationOptimizationMachine Learning
X
Xin Man
Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, China
F
Feiyu Chen
Sichuan Artificial Intelligence Research Institute, China University of Electronic Science and Technology of China, China
Jie Shao
Jie Shao
Professor, University of Electronic Science and Technology of China
MultimediaDatabase
H
Heng Tao Shen
Sichuan Artificial Intelligence Research Institute, Yibin, China and University of Electronic Science and Technology of China, Chengdu, China