Occlusion Robustness of CLIP for Military Vehicle Classification

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study addresses the robustness of military vehicle classification under partial occlusion and low signal-to-noise ratio—challenging conditions that severely degrade model performance in real-world deployment. Method: We systematically evaluate vision-language models (e.g., CLIP) for zero-shot and fine-tuned classification under label-scarce settings, using a custom military vehicle dataset. We quantitatively analyze occlusion patterns (fine-grained scattered vs. contiguous large-area), propose backbone fine-tuning coupled with occlusion-aware data augmentation, and introduce Normalized Area Under the Occlusion-Robustness Curve (NAUC) as a novel metric. Contribution/Results: Fine-grained occlusion proves more detrimental than contiguous occlusion; our method raises the performance collapse threshold from 35% to over 60% occlusion. Transformer-based architectures significantly outperform CNNs. Optimized models retain usable classification accuracy even at 60% occlusion, establishing a quantifiable robustness evaluation framework and practical enhancement strategy for operational deployment.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

Problem

Research questions and friction points this paper is trying to address.

Evaluating CLIP's robustness for military vehicle classification under occlusion

Assessing performance degradation from fine-grained versus contiguous occlusions

Investigating model resilience through architectural variations and fine-tuning strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based CLIP outperforms CNNs

Fine-tuning backbone delays performance drop to 60% occlusion

Occlusion-specific augmentations improve robustness

🔎 Similar Papers

No similar papers found.