🤖 AI Summary
This study addresses the robustness of military vehicle classification under partial occlusion and low signal-to-noise ratio—challenging conditions that severely degrade model performance in real-world deployment. Method: We systematically evaluate vision-language models (e.g., CLIP) for zero-shot and fine-tuned classification under label-scarce settings, using a custom military vehicle dataset. We quantitatively analyze occlusion patterns (fine-grained scattered vs. contiguous large-area), propose backbone fine-tuning coupled with occlusion-aware data augmentation, and introduce Normalized Area Under the Occlusion-Robustness Curve (NAUC) as a novel metric. Contribution/Results: Fine-grained occlusion proves more detrimental than contiguous occlusion; our method raises the performance collapse threshold from 35% to over 60% occlusion. Transformer-based architectures significantly outperform CNNs. Optimized models retain usable classification accuracy even at 60% occlusion, establishing a quantifiable robustness evaluation framework and practical enhancement strategy for operational deployment.
📝 Abstract
Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.