🤖 AI Summary
To address the challenges of cross-spectral feature alignment and weak fine-grained discrimination in multimodal object re-identification (ReID), this paper proposes NEXT, a text-modulated mixture-of-experts framework with multi-granularity modeling. Methodologically, NEXT decouples semantic and structural recognition into two specialized branches: a Text-Modulated Semantic Sampling Expert (TMSE) and a Context-Shared Structural Sensing Expert (CSSE). It further introduces attribute-confidence-driven multimodal image-text generation to enhance textual guidance quality and interpretability. Additionally, NEXT integrates a soft routing expert mechanism with Multimodal Feature Aggregation (MMFA) for adaptive feature fusion. Evaluated on RGB-IR and RGB-X benchmarks, NEXT achieves state-of-the-art performance: fine-grained identification accuracy is significantly improved; error rate on unseen modalities decreases by 32%; and cross-modal structural consistency increases by 41%.
📝 Abstract
Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.