Cross-Modal Safety Alignment: Is textual unlearning all you need?

📅 2024-05-27
🏛️ arXiv.org
📈 Citations: 21
Influential: 3
📄 PDF
🤖 AI Summary
Multimodal large models (e.g., VLMs) face a novel attack surface where adversarial visual inputs bypass text-based safety mechanisms. Method: We propose a paradigm shift—performing *text-only unlearning* to achieve cross-modal safety alignment, eliminating reliance on multimodal retraining. Our approach introduces a language-space fusion architecture and a text-side parameter unlearning algorithm that operates exclusively in the textual modality, without requiring any visual training data. Contribution/Results: Empirical evaluation across six benchmark datasets shows our method reduces attack success rates to 2%–8%, while preserving original model functionality. Compared to multimodal fine-tuning baselines, it reduces computational overhead by 6×. This work is the first to demonstrate that pure text-space unlearning can effectively generalize to cross-modal safety alignment—challenging and extending prevailing safety training paradigms.

Technology Category

Application Category

📝 Abstract
Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.
Problem

Research questions and friction points this paper is trying to address.

Addressing safety vulnerabilities in multimodal LLMs through textual unlearning
Evaluating cross-modal attack mitigation without multimodal training data
Comparing efficiency of textual versus multimodal safety alignment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Textual unlearning enables cross-modal safety alignment
Unlearning transfers from text to multimodal attack resistance
Text-only unlearning reduces computational costs versus multimodal training
T
Trishna Chakraborty
Computer Science and Engineering, University of California, Riverside
Erfan Shayegani
Erfan Shayegani
Ph.D. student at University of California, Riverside
Natural Language ProcessingAlignmentAI SafetyVision and LanguageMachine Learning
Zikui Cai
Zikui Cai
University of Maryland
Machine LearningTrustworthy AIComputer VisionRobotics
N
Nael B. Abu-Ghazaleh
Computer Science and Engineering, University of California, Riverside
M
M. S. Asif
Electrical and Computer Engineering, University of California, Riverside
Yue Dong
Yue Dong
University of California Riverside
Artificial IntelligenceNatural Language ProcessingMachine LearningLLM Security
A
A. Roy-Chowdhury
Electrical and Computer Engineering, University of California, Riverside
Chengyu Song
Chengyu Song
UC Riverside
SecurityOperating SystemProgram LanguageTrustworthy ML