Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) lack dedicated safety mechanisms for the visual modality; text-aligned safety strategies from LLMs do not naturally generalize to image inputs, rendering them vulnerable to adversarial or harmful images. Method: We propose “Safety Tensors”—trainable, cross-modal bridging vectors that activate safety layers within the language module during inference in response to visual inputs, without modifying model parameters. Safety Tensors are injected into either the text or vision pathway and optimized jointly on malicious image–text pairs, structurally similar benign contrastive samples, and general benign data to transfer textual safety knowledge to the visual modality. Contribution/Results: Hidden-layer analysis confirms effective activation of safety mechanisms. Experiments demonstrate substantial improvements in rejection rates across diverse harmful images while preserving performance on benign vision-language tasks, achieving a balanced trade-off between safety and functionality.

Technology Category

Application Category

📝 Abstract

Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.

Problem

Research questions and friction points this paper is trying to address.

Extending text-based safety to visual inputs in LVLMs

Addressing cross-modal safety gaps without parameter modification

Enhancing rejection of harmful visual inputs while preserving functionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Security tensors bridge text and visual safety

Trainable vectors transfer safety without parameter changes

Optimized with malicious and contrastive benign datasets

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?