SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) achieve strong performance in image recognition but suffer from poor adversarial robustness and severe adversarial overfitting. To address this, we propose SAFER, a layer-selective fine-tuning framework that identifies the most vulnerable layers via layer-wise sensitivity analysis and applies Sharpness-Aware Minimization (SAM) exclusively to their parameters—while freezing all other layers. This introduces the novel “sensitive-layer-focused” paradigm, enabling simultaneous improvement in both clean accuracy and adversarial robustness without any additional computational overhead. Evaluated across multiple ViT architectures—including ViT-Base and DeiT-S—and benchmark datasets (ImageNet-1k, CIFAR-10/100), SAFER consistently boosts accuracy by 5–20%, markedly mitigates adversarial overfitting, and achieves synergistic enhancement of robustness and generalization.

Technology Category

Application Category

📝 Abstract
Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets.
Problem

Research questions and friction points this paper is trying to address.

Visual Transformers
Image Recognition
Accuracy and Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAFER
Visual Transformers
Performance Enhancement
🔎 Similar Papers
No similar papers found.
Bhavna Gopal
Bhavna Gopal
PhD student @ Duke University
Computer VisionNeural Architecture SearchAI Safety and PrivacyAdversarial Robustness
Huanrui Yang
Huanrui Yang
Assistant Professor, ECE, University of Arizona
Efficient deep learningTrustworthy deep learning
M
Mark Horton
Department of Electrical and Computer Engineering, Duke University
Y
Yiran Chen
Department of Electrical and Computer Engineering, Duke University