Chat-Driven Text Generation and Interaction for Person Retrieval

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Text-based person search (TBPS) suffers from poor scalability due to its heavy reliance on labor-intensive, fine-grained human-annotated text descriptions. To address this, we propose the first fully annotation-free conversational text generation and interaction framework for TBPS. Methodologically, we introduce a novel multi-turn text generation (MTG) mechanism to leverage large language models (LLMs) for synthesizing high-quality pseudo-labels, coupled with a multi-turn text interaction (MTI) mechanism that dynamically refines ambiguous or incomplete user queries during inference. Our framework eliminates dependence on manual text annotations entirely while achieving competitive—or even superior—retrieval performance on mainstream benchmarks including CUHK-PEDES and RSTPReid. The core contribution lies in pioneering the integration of conversational generation and interaction paradigms into TBPS, enabling joint optimization of unsupervised pseudo-label construction and query enhancement—thereby significantly improving practical utility and deployment scalability.

Technology Category

Application Category

📝 Abstract

Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.

Problem

Research questions and friction points this paper is trying to address.

Automating text annotation for person retrieval scalability

Resolving vague user queries in real-world search scenarios

Eliminating manual supervision in text-based person search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Turn Text Generation for pseudo-labels

Multi-Turn Text Interaction refines queries

Unified annotation-free framework improves retrieval

🔎 Similar Papers

No similar papers found.