🤖 AI Summary
Security-critical vision systems are vulnerable to multi-patch local adversarial perturbations—such as occlusions or scanning artifacts in medical imaging—yet existing defenses are largely limited to single-patch attacks. To address this, we propose Filtered-ViT, a Vision Transformer architecture integrated with a SMART Vector Median Filtering (SMART-VMF) module. This module jointly incorporates spatial adaptivity, multi-scale processing, and robustness-aware filtering to selectively suppress interference across multiple regions in feature space while preserving semantic fidelity. It is the first method to provide unified robustness against diverse adversarial and natural patch-like corruptions. Under four 1%-area adversarial patches on ImageNet, Filtered-ViT achieves 46.3% robust accuracy (vs. 79.8% clean accuracy). Moreover, it effectively mitigates noise and artifacts in real-world radiographic images without compromising clinically relevant diagnostic information.
📝 Abstract
Deep learning vision systems are increasingly deployed in safety-critical domains such as healthcare, yet they remain vulnerable to small adversarial patches that can trigger misclassifications. Most existing defenses assume a single patch and fail when multiple localized disruptions occur, the type of scenario adversaries and real-world artifacts often exploit. We propose Filtered-ViT, a new vision transformer architecture that integrates SMART Vector Median Filtering (SMART-VMF), a spatially adaptive, multi-scale, robustness-aware mechanism that enables selective suppression of corrupted regions while preserving semantic detail. On ImageNet with LaVAN multi-patch attacks, Filtered-ViT achieves 79.8% clean accuracy and 46.3% robust accuracy under four simultaneous 1% patches, outperforming existing defenses. Beyond synthetic benchmarks, a real-world case study on radiographic medical imagery shows that Filtered-ViT mitigates natural artifacts such as occlusions and scanner noise without degrading diagnostic content. This establishes Filtered-ViT as the first transformer to demonstrate unified robustness against both adversarial and naturally occurring patch-like disruptions, charting a path toward reliable vision systems in truly high-stakes environments.