🤖 AI Summary
To address performance degradation in person re-identification (ReID) under complex surveillance scenarios—caused by occlusion, viewpoint distortion, and low-quality imagery—this paper proposes Sh-ViT, a lightweight Vision Transformer (ViT)-based model. Methodologically, it introduces (1) a spatial shuffling module to disrupt ViT’s inherent local inductive bias and enhance robustness against partial occlusion; (2) an adaptive multi-modal data augmentation strategy tailored for surveillance, incorporating geometric transformations, random erasing, blurring, and color jittering; and (3) DeiT-based knowledge distillation for efficient model compression. To alleviate the scarcity of real-world occluded training data, we construct MyTT—the first fine-grained occlusion-specific ReID benchmark. Experiments demonstrate that Sh-ViT achieves 83.2% Rank-1 accuracy and 80.1% mAP on MyTT, and 94.6% Rank-1 accuracy and 87.5% mAP on Market1501—substantially outperforming state-of-the-art CNN- and ViT-based methods.
📝 Abstract
Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.