A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models exhibit insufficient robustness under adversarial perturbations, limiting their practical deployment. This work proposes ET3, a lightweight, training-free test-time defense method that, for the first time, introduces an energy minimization mechanism into the test phase of vision-language models. By leveraging energy-guided input transformation, ET3 enhances model robustness while, under reasonable assumptions, providing theoretical guarantees for classification correctness. Experimental results demonstrate that ET3 significantly outperforms existing test-time defenses across multiple tasks—including image classification, image captioning, and visual question answering—effectively improving adversarial robustness without requiring retraining or architectural modifications.
📝 Abstract
Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .
Problem

Research questions and friction points this paper is trying to address.

adversarial robustness
Large Vision-Language Models
adversarial perturbations
test-time defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-Guided
Test-Time Transformation
Adversarial Robustness
Large Vision-Language Models
Training-Free Defense
🔎 Similar Papers
No similar papers found.
M
Mujtaba Hussain Mirza
OmnAI Lab, Computer Science Department, Sapienza University of Rome, Italy
A
Antonio D'Orazio
OmnAI Lab, Computer Science Department, Sapienza University of Rome, Italy
O
Odelia Melamed
Weizmann Institute of Science, Israel
Iacopo Masi
Iacopo Masi
Sapienza University of Rome
Computer Vision