Robustifying Vision-Language Models via Dynamic Token Reweighting

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large vision-language models (VLMs) are vulnerable to vision-text co-located jailbreaking attacks; existing defenses often rely on safety-labeled data or image-to-text modules, limiting generalizability and inference efficiency. To address this, we propose Dynamic Token Reweighting (DTR), an inference-time defense that requires no additional training, safety annotations, or cross-modal translation. DTR is the first method to leverage KV cache optimization for multimodal security: it dynamically models distributional shifts induced by visual inputs and reweights visual token importance in real time, thereby enhancing robustness at the vision–text interaction level. Evaluated across multiple VLM architectures and jailbreaking benchmarks, DTR significantly improves adversarial resistance while preserving performance on benign tasks and incurring minimal computational overhead—outperforming state-of-the-art defenses in overall effectiveness.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model's key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model's general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available: https://anonymous.4open.science/r/DTR-2755 (warning: this paper contains potentially harmful content generated by VLMs.)

Problem

Research questions and friction points this paper is trying to address.

Mitigates multimodal jailbreak attacks in VLMs

Optimizes KV caches without safety-specific data

Dynamically adjusts visual token weights for safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token reweighting for robustness

KV cache optimization for safety

Formulating safety-relevant distributional shift

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM