NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Vision-language models (VLMs), such as CLIP, exhibit severe vulnerability to adversarial attacks in the image modality due to poor robustness. To address this, we propose an architecture-aware multimodal prompt tuning framework: the Neural Enhancer, which jointly optimizes cross-modal (image-text) prompts, injects layered prompt embeddings across transformer depths, and purifies feature representations in latent space. We further introduce a token refiner with residual connections to enable modality- and layer-adaptive feature restoration. Our method integrates adversarial prompt tuning with residual feature reconstruction, achieving substantial robustness gains under AutoAttack—improving certified accuracy by 33.5% and 33.0% over the strongest baseline on ViT-B/16 and ViT-B/32 backbones, respectively, while retaining competitive clean-sample accuracy. This work is the first to unify multi-layer architectural awareness, cross-modal prompt coordination, and feature-space purification within a single prompt tuning paradigm, establishing a novel direction for enhancing VLM robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations include: (1) extending AdvPT from text-only to multi-modal prompting across both text and visual modalities, (2) expanding from single-layer to multi-layer prompt architectures, and (3) proposing a novel architecture-level redesign through our Neural Augmentor approach, which implements feature purification to directly address the distortions introduced by adversarial attacks in feature space. Our NAP-Tuning approach incorporates token refiners that learn to reconstruct purified features through residual connections, allowing for modality-specific and layer-specific feature correction.Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.

Problem

Research questions and friction points this paper is trying to address.

Enhancing adversarial robustness in Vision-Language Models

Extending prompt tuning to multi-modal text and visual inputs

Purifying adversarial distortions in feature space via Neural Augmentor

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal prompting across text and visual modalities

Multi-layer prompt architectures for enhanced robustness

Neural Augmentor with feature purification for adversarial distortions

🔎 Similar Papers

No similar papers found.