Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the vulnerability of vision-language models (VLMs) to adversarial attacks and the limited generalization and high deployment cost of existing defenses. The authors propose SAEgis, a novel framework that integrates sparse autoencoders (SAEs) as plug-and-play modules into pretrained VLMs. By leveraging a reconstruction objective without adversarial training, SAEgis learns sparse latent representations capable of effectively detecting adversarial perturbations. Notably, the approach requires neither modifications to the backbone model nor additional adversarial training, thereby achieving strong generalization across domains and attack types while incurring minimal computational overhead. Extensive experiments demonstrate that SAEgis consistently outperforms current baselines across diverse evaluation settings.

📝 Abstract

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Adversarial Attack Detection

Cross-Domain Generalization

Model Safety

Sparse Autoencoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Adversarial Attack Detection

Vision-Language Models