Chat-Driven Text Generation and Interaction for Person Retrieval

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-based person search (TBPS) suffers from poor scalability due to its heavy reliance on labor-intensive, fine-grained human-annotated text descriptions. To address this, we propose the first fully annotation-free conversational text generation and interaction framework for TBPS. Methodologically, we introduce a novel multi-turn text generation (MTG) mechanism to leverage large language models (LLMs) for synthesizing high-quality pseudo-labels, coupled with a multi-turn text interaction (MTI) mechanism that dynamically refines ambiguous or incomplete user queries during inference. Our framework eliminates dependence on manual text annotations entirely while achieving competitive—or even superior—retrieval performance on mainstream benchmarks including CUHK-PEDES and RSTPReid. The core contribution lies in pioneering the integration of conversational generation and interaction paradigms into TBPS, enabling joint optimization of unsupervised pseudo-label construction and query enhancement—thereby significantly improving practical utility and deployment scalability.

Technology Category

Application Category

📝 Abstract
Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
Problem

Research questions and friction points this paper is trying to address.

Automating text annotation for person retrieval scalability
Resolving vague user queries in real-world search scenarios
Eliminating manual supervision in text-based person search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Turn Text Generation for pseudo-labels
Multi-Turn Text Interaction refines queries
Unified annotation-free framework improves retrieval
🔎 Similar Papers
No similar papers found.
Z
Zequn Xie
Zhejiang University
Chuxin Wang
Chuxin Wang
University of Science and Technology of China
3D Computer Vision and 3D Object Detection
S
Sihang Cai
Zhejiang University
Y
Yeqiang Wang
Northwest A&F University
Shulei Wang
Shulei Wang
Zhejiang University
multimodal learningcomputer visiondiffusion modal
T
Tao Jin
Zhejiang University