π€ AI Summary
This work addresses the insufficient exploitation of local details and global semantic information in person re-identification caused by occlusion and pose variations. To this end, it proposes a dual-regularized bidirectional Transformer architecture that systematically integrates the vision foundation model DINO with the vision-language model CLIPβa first in the field. The method employs a bidirectional interaction mechanism to jointly extract local texture and global semantic features, complemented by a dual regularization strategy that dynamically balances their contributions. Evaluated on five mainstream ReID benchmarks, the approach achieves competitive performance, significantly enhancing the fusion of local and global representations and thereby demonstrating the effectiveness and novelty of the proposed architecture.
π Abstract
Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.