Security Risk of Misalignment between Text and Image in Multi-modal Model

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work identifies a critical security vulnerability in multimodal diffusion models arising from insufficient text–image modality alignment, particularly exacerbating risks in NSFW content generation. To address this, we propose PReMA—the first text-constrained multimodal adversarial attack framework that manipulates text-conditioned image generation solely via adversarial input images, without altering the textual prompt. PReMA successfully induces semantic misalignment in both inpainting and style transfer tasks, causing models to generate content inconsistent with the intended textual semantics. Extensive experiments demonstrate its cross-task and cross-architecture generalizability across mainstream models including Stable Diffusion. To our knowledge, this is the first systematic exposure of input-image–side vulnerabilities in text-guided generative modeling. Our work establishes a novel paradigm for evaluating and improving alignment robustness in multimodal foundation models, offering foundational insights for developing effective defenses against modality-misaligned adversarial threats.

Technology Category

Application Category

📝 Abstract

Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal models have inadequate text-image alignment

Misalignment enables generation of inappropriate NSFW content

Adversarial images can manipulate outputs with fixed prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial image manipulation without prompt changes

First attack using only adversarial images

Targets multi-modal diffusion model vulnerabilities

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?