Vision Transformers: the threat of realistic adversarial patches

πŸ“… 2025-09-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work exposes the severe vulnerability of Vision Transformers (ViTs) to adversarial patch attacks under realistic conditions. To address person/non-person classification, we propose the Crinkle Transform (CT), a geometric distortion modeling technique that generates photorealistic, wearable adversarial patches. We systematically evaluate the cross-architecture transferability of CNN-derived adversarial patches to ViTs and conduct both black-box and white-box attacks across multiple fine-tuned ViT variants. Results demonstrate strong transferability: adversarial patches achieve up to 99.97% attack success rate. Moreover, pretraining data scale and strategy significantly influence ViT robustness. This study provides the first empirical evidence of ViTs’ security weaknesses under realistic physical deformations, offering critical insights for deploying robust vision models. It establishes foundational empirical benchmarks and opens new research directions for adversarial defense tailored to transformer-based architectures.

Technology Category

Application Category

πŸ“ Abstract
The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.
Problem

Research questions and friction points this paper is trying to address.

Investigating Vision Transformers' vulnerability to realistic adversarial patches
Assessing transferability of CNN adversarial attacks to ViT classification models
Evaluating how pre-training affects ViT resilience against evasion attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Creases Transformation for realistic adversarial patches
Tests patch transferability from CNNs to Vision Transformers
Evaluates four fine-tuned ViT models on binary classification
πŸ”Ž Similar Papers
No similar papers found.
K
Kasper Cools
Belgian Royal Military Academy, Brussels, Belgium
C
Clara Maathuis
Open University of the Netherlands, Heerlen, the Netherlands
A
Alexander M. van Oers
Netherlands Defence Academy, Den Helder, the Netherlands
C
Claudia S. HΓΌbner
Fraunhofer Institute of Optronics, Ettlingen, Germany
Nikos Deligiannis
Nikos Deligiannis
Vrije Universiteit Brussel, imec
Signal ProcessingMachine LearningComputer VisionExplainable AI
M
Marijke Vandewal
Belgian Royal Military Academy, Brussels, Belgium
Geert De Cubber
Geert De Cubber
Royal Military Academy
RoboticsComputer VisionArtificial IntelligenceSearch and RescueDrone Detection