TVPR: Text-to-Video Person Retrieval and a New Benchmark

📅 2023-07-14
🏛️ ACM Multimedia
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of text-to-image person retrieval—namely, its reliance on static frames lacking motion cues and susceptibility to occlusion—this paper introduces a novel task: text-to-video person retrieval (TVPR). We present TVPReid, the first large-scale pedestrian video benchmark annotated with fine-grained natural language descriptions. Methodologically, we pioneer the integration of video modality into text-driven person retrieval and propose Multi-Feature-Guided Fragment Learning (MFGF), a strategy that jointly models text–visual and text–motion cross-modal relationships to enable effective cross-modal representation learning and dual latent-space alignment. Evaluated on TVPReid, our approach achieves state-of-the-art performance. The dataset is publicly released, establishing a new benchmark and technical foundation for cross-modal video understanding and retrieval research.
📝 Abstract
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text and video feature collaborative learning to progressively reduce the semantic difference between text and video. To evaluate the effectiveness of the proposed MFGF, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, MFGF is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.
Problem

Research questions and friction points this paper is trying to address.

Text-to-video person retrieval lacks dynamic information in isolated frames
No existing dataset describes person videos with natural language annotations
Uncertain occlusion and variable motion details hinder retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multielement Feature Guided Fragments Learning strategy
Cross-modal text-video representations for matching
Two cross-modal spaces for semantic alignment
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
F
Fan Ni
Nanjing Tech University
X
Xu Zhang
Nanjing Tech University
J
Jianhui Wu
Nanjing Tech University
G
Guan-Nan Dong
Nanjing Tech University
Aichun Zhu
Aichun Zhu
Nanjing Tech University
H
Hui Liu
Nanjing Tech University
Y
Yue Zhang