MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the critical gap in safety evaluation of large language models (LLMs), which has predominantly focused on text-only inputs and lacks systematic assessment of alignment generalization under multimodal conditions. We propose the first open-source, execution-centric unified safety evaluation platform for multimodal LLMs, integrating cross-modal adversarial payload generation, multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), modality-agnostic model routing, and an LLM-based five-tier safety judgment mechanism, with browser-based interactive support. A key innovation is the introduction of “Inter-Turn Modality Switching,” enabling the first systematic investigation into how modality transitions across conversational turns affect safety alignment. We further design dual soft-hard metrics to evaluate partial compliance risks. Experiments across six multimodal LLMs from four vendors demonstrate that multi-turn attacks can elevate success rates from near-perfect single-turn rejection to 90–100%, with modality effects exhibiting model-family-specific patterns.

Technology Category

Application Category

📝 Abstract

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

Problem

Research questions and friction points this paper is trying to address.

multimodal safety evaluation

large language models

red-teaming

cross-modal alignment

modality generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Safety Evaluation

Inter-Turn Modality Switching

Cross-Modal Red-Teaming

Multi-turn Attack Algorithms

Run-Centric Platform

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

2024-05-23Citations: 7