Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address performance degradation in person re-identification (ReID) under complex surveillance scenarios—caused by occlusion, viewpoint distortion, and low-quality imagery—this paper proposes Sh-ViT, a lightweight Vision Transformer (ViT)-based model. Methodologically, it introduces (1) a spatial shuffling module to disrupt ViT’s inherent local inductive bias and enhance robustness against partial occlusion; (2) an adaptive multi-modal data augmentation strategy tailored for surveillance, incorporating geometric transformations, random erasing, blurring, and color jittering; and (3) DeiT-based knowledge distillation for efficient model compression. To alleviate the scarcity of real-world occluded training data, we construct MyTT—the first fine-grained occlusion-specific ReID benchmark. Experiments demonstrate that Sh-ViT achieves 83.2% Rank-1 accuracy and 80.1% mAP on MyTT, and 94.6% Rank-1 accuracy and 87.5% mAP on Market1501—substantially outperforming state-of-the-art CNN- and ViT-based methods.

Technology Category

Application Category

📝 Abstract

Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

Problem

Research questions and friction points this paper is trying to address.

Improving occluded person re-identification in complex surveillance scenes

Addressing challenges of occlusion and blur in person re-identification

Enhancing robustness to occlusion without external complex modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shuffle module breaks spatial correlations for occlusion robustness

Scenario-adapted augmentation simulates real surveillance conditions

DeiT-based knowledge distillation improves learning with limited labels

🔎 Similar Papers

No similar papers found.