DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the insufficient exploitation of local details and global semantic information in person re-identification caused by occlusion and pose variations. To this end, it proposes a dual-regularized bidirectional Transformer architecture that systematically integrates the vision foundation model DINO with the vision-language model CLIPβ€”a first in the field. The method employs a bidirectional interaction mechanism to jointly extract local texture and global semantic features, complemented by a dual regularization strategy that dynamically balances their contributions. Evaluated on five mainstream ReID benchmarks, the approach achieves competitive performance, significantly enhancing the fusion of local and global representations and thereby demonstrating the effectiveness and novelty of the proposed architecture.

Technology Category

Application Category

πŸ“ Abstract
Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

person re-identification
vision foundation models
vision-language models
feature integration
occlusion and pose variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Regularized Transformer
Person Re-identification
Vision Foundation Models
Vision-Language Models
Feature Fusion
πŸ”Ž Similar Papers
No similar papers found.
Y
Ying Shu
Institute of Network Science and Intelligent Systems, Beijing Jiaotong University, Beijing, China
P
Pujian Zhan
Institute of Network Science and Intelligent Systems, Beijing Jiaotong University, Beijing, China
H
Huiqi Yang
Institute of Network Science and Intelligent Systems, Beijing Jiaotong University, Beijing, China
Hehe Fan
Hehe Fan
Zhejiang University
Deep learningComputer visionMultimediaAI for science
Y
Youfang Lin
Institute of Network Science and Intelligent Systems, Beijing Jiaotong University, Beijing, China
Kai Lv
Kai Lv
Beijing Jiaotong University
Computer VisionDeep Learning