🤖 AI Summary
To address the challenge of fragmented, ambiguous, and incomplete witness descriptions in real-world scenarios, this paper introduces Interactive Person Re-Identification (Inter-ReID): a novel task enabling robust cross-camera retrieval via multi-turn vision-language dialogue to dynamically refine textual descriptions. Methodologically, we (1) formally define the Inter-ReID paradigm; (2) construct the first fine-grained, multi-type question-answering dialogue dataset for person retrieval; (3) propose a forward selection supervision strategy that prioritizes questions yielding maximal information gain; and (4) design a LLaVA-based, multi-image-aware QA model that jointly encodes visual features and textual context for conditional question generation, augmented by fine-grained attribute decomposition to guide dialogue modeling. Experiments demonstrate significant improvements over state-of-the-art baselines on both the proposed Inter-ReID benchmark and standard text-based ReID tasks.
📝 Abstract
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.