InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the lack of interpretable semantic alignment in existing text-to-image person re-identification (TI-ReID) methods, which struggle to accurately associate visual regions with part-level concepts in language descriptions. To bridge this gap, the authors propose InterPartAbility, the first approach to achieve open-vocabulary, part-level semantic alignment in TI-ReID. Built upon the CLIP ViT architecture, it introduces a lightweight Patch-Phrase Interaction Module (PPIM) and self-attention spatial constraints to generate explanation maps that align image patches with textual phrases, further refined through concept-level supervision. Additionally, the study presents the first quantitative interpretability evaluation protocol, measuring retrieval performance degradation under counterfactual region masking. Experiments demonstrate that InterPartAbility achieves state-of-the-art results on interpretability metrics while maintaining leading retrieval accuracy on CUHK-PEDES and ICFG-PEDES benchmarks.

📝 Abstract

Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

Problem

Research questions and friction points this paper is trying to address.

person re-identification

interpretability

text-to-image retrieval

phrase-region grounding

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable person re-identification

phrase-region grounding

part-wise matching