TVPR: Text-to-Video Person Retrieval and a New Benchmark

📅 2023-07-14

🏛️ ACM Multimedia

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the limitations of text-to-image person retrieval—namely, its reliance on static frames lacking motion cues and susceptibility to occlusion—this paper introduces a novel task: text-to-video person retrieval (TVPR). We present TVPReid, the first large-scale pedestrian video benchmark annotated with fine-grained natural language descriptions. Methodologically, we pioneer the integration of video modality into text-driven person retrieval and propose Multi-Feature-Guided Fragment Learning (MFGF), a strategy that jointly models text–visual and text–motion cross-modal relationships to enable effective cross-modal representation learning and dual latent-space alignment. Evaluated on TVPReid, our approach achieves state-of-the-art performance. The dataset is publicly released, establishing a new benchmark and technical foundation for cross-modal video understanding and retrieval research.

📝 Abstract

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text and video feature collaborative learning to progressively reduce the semantic difference between text and video. To evaluate the effectiveness of the proposed MFGF, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, MFGF is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

Problem

Research questions and friction points this paper is trying to address.

Text-to-video person retrieval lacks dynamic information in isolated frames

No existing dataset describes person videos with natural language annotations

Uncertain occlusion and variable motion details hinder retrieval performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multielement Feature Guided Fragments Learning strategy

Cross-modal text-video representations for matching

Two cross-modal spaces for semantic alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs