🤖 AI Summary
This work addresses the significant semantic gap between visual and textual modalities and the challenge of fine-grained attribute alignment in text-to-image person re-identification. To this end, we propose DiCo, a novel framework that introduces, for the first time, a hierarchical structure of slots and concept blocks. Shared slots enable part-level cross-modal alignment, while each slot is decomposed into complementary concept blocks representing fine-grained attributes such as color, texture, and shape, thereby achieving disentangled representations. Integrating slot-based attention mechanisms with cross-modal contrastive learning, DiCo achieves state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid benchmarks, while also enabling high interpretability and fine-grained retrieval capabilities.