DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the insufficient exploitation of local details and global semantic information in person re-identification caused by occlusion and pose variations. To this end, it proposes a dual-regularized bidirectional Transformer architecture that systematically integrates the vision foundation model DINO with the vision-language model CLIP—a first in the field. The method employs a bidirectional interaction mechanism to jointly extract local texture and global semantic features, complemented by a dual regularization strategy that dynamically balances their contributions. Evaluated on five mainstream ReID benchmarks, the approach achieves competitive performance, significantly enhancing the fusion of local and global representations and thereby demonstrating the effectiveness and novelty of the proposed architecture.

Technology Category

Application Category

📝 Abstract

Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

person re-identification

vision foundation models

vision-language models

feature integration

occlusion and pose variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Regularized Transformer

Person Re-identification

Vision Foundation Models