LipShiFT: A Certifiably Robust Shift-based Vision Transformer

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Vision Transformers—particularly ShiftViT—lack provable ℓ₂-robustness due to unbounded Lipschitz constants arising from high-dimensional self-attention and large input dimensions. Method: We propose the first Lipschitz-continuous variant of ShiftViT by deeply integrating Lipschitz constraints into its lightweight shift operations. We derive a tight theoretical upper bound on the global Lipschitz constant and enable efficient training via Lipschitz-margin loss, layer-wise weight clipping, and ℓ₂-norm-constrained optimization. Contribution/Results: Our method significantly improves certified ℓ₂-robust accuracy on standard image classification benchmarks (e.g., CIFAR-10/100, ImageNet-1k), scales effectively to larger models, and establishes the first state-of-the-art certified robustness for Transformer architectures under ℓ₂ perturbations.

Technology Category

Application Category

📝 Abstract

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the $l_2$ norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.

Problem

Research questions and friction points this paper is trying to address.

Deriving tight Lipschitz bounds for transformer-based architectures.

Addressing training challenges in norm-constrained input settings.

Advancing certified robustness for transformer-based vision models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lipschitz-based margin training regularizes model weights

ShiftViT model addresses transformer training challenges

Upper bound Lipschitz constants using $l_2$ norm

🔎 Similar Papers

No similar papers found.