Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in person re-identification (ReID) under complex surveillance scenarios—caused by occlusion, viewpoint distortion, and low-quality imagery—this paper proposes Sh-ViT, a lightweight Vision Transformer (ViT)-based model. Methodologically, it introduces (1) a spatial shuffling module to disrupt ViT’s inherent local inductive bias and enhance robustness against partial occlusion; (2) an adaptive multi-modal data augmentation strategy tailored for surveillance, incorporating geometric transformations, random erasing, blurring, and color jittering; and (3) DeiT-based knowledge distillation for efficient model compression. To alleviate the scarcity of real-world occluded training data, we construct MyTT—the first fine-grained occlusion-specific ReID benchmark. Experiments demonstrate that Sh-ViT achieves 83.2% Rank-1 accuracy and 80.1% mAP on MyTT, and 94.6% Rank-1 accuracy and 87.5% mAP on Market1501—substantially outperforming state-of-the-art CNN- and ViT-based methods.

Technology Category

Application Category

📝 Abstract
Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
Problem

Research questions and friction points this paper is trying to address.

Improving occluded person re-identification in complex surveillance scenes
Addressing challenges of occlusion and blur in person re-identification
Enhancing robustness to occlusion without external complex modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shuffle module breaks spatial correlations for occlusion robustness
Scenario-adapted augmentation simulates real surveillance conditions
DeiT-based knowledge distillation improves learning with limited labels
🔎 Similar Papers
No similar papers found.
B
Bo Li
China Tower Corporation Limited, Beijing,100089, China
D
Duyuan Zheng
China University of Petroleum -Beijing at Karamay, Karamay,834000, China
X
Xinyang Liu
China University of Petroleum -Beijing at Karamay, Karamay,834000, China
Qingwen Li
Qingwen Li
Suzhou Institute of nanotech and nanobionics
carbon materialssynthesissortingapplication
H
Hong Li
China Tower Corporation Limited, Beijing,100089, China
H
Hongyan Cui
Beijing University of Posts and Telecommunications, Beijing, 100876, China
G
Ge Gao
China University of Petroleum -Beijing at Karamay, Karamay,834000, China
C
Chen Liu
China University of Petroleum -Beijing at Karamay, Karamay,834000, China