Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of "expression drift" in text-to-image person retrieval, where the diversity of natural language expressions and the implicit nature of visual semantics undermine cross-modal alignment robustness. To mitigate this issue, the authors propose a training-free Multi-view Semantic Reconstruction (MVR) framework leveraging large language models to generate semantically equivalent yet linguistically diverse textual variants. Concurrently, a vision-language model produces multi-view image descriptions to bridge the semantic gap. The method introduces a novel training-free multi-view semantic compensation mechanism that integrates key-feature guidance, diversity-aware rewriting, feature mean pooling, and residual connections. Evaluated on three benchmark datasets, the approach achieves state-of-the-art performance, significantly improving retrieval accuracy without requiring additional model training.

Technology Category

Application Category

📝 Abstract

In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing "Semantic Echoes"; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.

Problem

Research questions and friction points this paper is trying to address.

Expression Drift

Text-to-Image Person Retrieval

Semantic Compensation

Cross-Modal Alignment

Natural Language Diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-View Reformulation

Semantic Compensation

Expression Drift