Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limited generalization capability of models in domain-generalized person re-identification, particularly the inability of existing vision-language models to capture subtle identity-discriminative cues due to their reliance on global features alone. To overcome this, we propose a CLIP-based multi-granularity vision-language alignment framework that leverages fine-grained textual prompts describing distinct body parts. By integrating an adaptive masked multi-head self-attention mechanism, our method aligns localized visual regions with their corresponding semantic descriptions. Furthermore, pseudo-labels for body parts generated by a multimodal large language model (MLLM) are employed to supervise the training process. Extensive experiments under both single-source and multi-source domain generalization settings demonstrate that our approach significantly outperforms state-of-the-art methods, validating the effectiveness and novelty of the proposed multi-granularity alignment strategy.

Technology Category

Application Category

📝 Abstract

Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.

Problem

Research questions and friction points this paper is trying to address.

Domain Generalized Person Re-Identification

Vision-Language Models

Multi-Grained Alignment

Generalization

Fine-Grained Features

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-grained alignment

vision-language model

domain generalization