🤖 AI Summary
This paper introduces Omni-Modal Person Re-Identification (OM-ReID), a novel task enabling cross-modal person retrieval under arbitrary query combinations of RGB, infrared, sketch, colored-pencil, and text modalities. To address the limitations of existing methods—namely, restricted modality coverage and insufficient unified modeling capability—the authors: (1) construct ORBench, the first high-quality five-modal benchmark; (2) propose a single-model unified encoder with a dynamic multi-expert routing architecture to achieve modality-agnostic feature alignment and collaborative fusion; and (3) incorporate cross-modal contrastive learning to enhance semantic consistency across heterogeneous modalities. Extensive experiments on ORBench demonstrate that the proposed method significantly outperforms state-of-the-art approaches, achieving robust and efficient retrieval across all possible modality combinations for the first time. The dataset and code are publicly released.
📝 Abstract
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.