Occlusion Robustness of CLIP for Military Vehicle Classification

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the robustness of military vehicle classification under partial occlusion and low signal-to-noise ratio—challenging conditions that severely degrade model performance in real-world deployment. Method: We systematically evaluate vision-language models (e.g., CLIP) for zero-shot and fine-tuned classification under label-scarce settings, using a custom military vehicle dataset. We quantitatively analyze occlusion patterns (fine-grained scattered vs. contiguous large-area), propose backbone fine-tuning coupled with occlusion-aware data augmentation, and introduce Normalized Area Under the Occlusion-Robustness Curve (NAUC) as a novel metric. Contribution/Results: Fine-grained occlusion proves more detrimental than contiguous occlusion; our method raises the performance collapse threshold from 35% to over 60% occlusion. Transformer-based architectures significantly outperform CNNs. Optimized models retain usable classification accuracy even at 60% occlusion, establishing a quantifiable robustness evaluation framework and practical enhancement strategy for operational deployment.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.
Problem

Research questions and friction points this paper is trying to address.

Evaluating CLIP's robustness for military vehicle classification under occlusion
Assessing performance degradation from fine-grained versus contiguous occlusions
Investigating model resilience through architectural variations and fine-tuning strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based CLIP outperforms CNNs
Fine-tuning backbone delays performance drop to 60% occlusion
Occlusion-specific augmentations improve robustness
🔎 Similar Papers
No similar papers found.
J
Jan Erik van Woerden
TNO, Oude Waalsdorperweg 63, 2597 AK , the Hague, the Netherlands
Gertjan Burghouts
Gertjan Burghouts
Deep Learning for Vision
machine learningdeep learningcomputer visionartificial intelligence
L
Lotte Nijskens
TNO, Oude Waalsdorperweg 63, 2597 AK , the Hague, the Netherlands
A
Alma M. Liezenga
TNO, Oude Waalsdorperweg 63, 2597 AK , the Hague, the Netherlands
Sabina van Rooij
Sabina van Rooij
TNO
computer visiondeep learning
Frank Ruis
Frank Ruis
TNO
Computer VisionData QualityDeep Learning
Hugo J. Kuijf
Hugo J. Kuijf
Senior Scientist, Intelligent Imaging, TNO
image processingmachine learning