DiCo: Disentangled concept representation for text-to-image person re-identification

📅 2026-01-01
🏛️ Neurocomputing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant semantic gap between visual and textual modalities and the challenge of fine-grained attribute alignment in text-to-image person re-identification. To this end, we propose DiCo, a novel framework that introduces, for the first time, a hierarchical structure of slots and concept blocks. Shared slots enable part-level cross-modal alignment, while each slot is decomposed into complementary concept blocks representing fine-grained attributes such as color, texture, and shape, thereby achieving disentangled representations. Integrating slot-based attention mechanisms with cross-modal contrastive learning, DiCo achieves state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks, while also enabling high interpretability and fine-grained retrieval capabilities.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Text-to-image person re-identification
modality gap
fine-grained correspondence
disentangled representation
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation
slot-based modeling
cross-modal alignment
text-to-image person re-identification
concept decomposition
🔎 Similar Papers
No similar papers found.
G
Giyeol Kim
Department of Imaging Science, Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University, Seoul, 06974, South Korea
Chanho Eom
Chanho Eom
Assistant Professor @Chung-Ang University
Computer VisionMachine LearningArtificial Intelligence