🤖 AI Summary
Large Vision-Language Models (LVLMs) lack dedicated safety mechanisms for the visual modality; text-aligned safety strategies from LLMs do not naturally generalize to image inputs, rendering them vulnerable to adversarial or harmful images.
Method: We propose “Safety Tensors”—trainable, cross-modal bridging vectors that activate safety layers within the language module during inference in response to visual inputs, without modifying model parameters. Safety Tensors are injected into either the text or vision pathway and optimized jointly on malicious image–text pairs, structurally similar benign contrastive samples, and general benign data to transfer textual safety knowledge to the visual modality.
Contribution/Results: Hidden-layer analysis confirms effective activation of safety mechanisms. Experiments demonstrate substantial improvements in rejection rates across diverse harmful images while preserving performance on benign vision-language tasks, achieving a balanced trade-off between safety and functionality.
📝 Abstract
Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.