OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the critical security vulnerabilities of multimodal large language models (MLLMs), which are inadequately evaluated by existing red-teaming approaches that are fragmented, limited to single-turn textual interactions, and lack scalability. To overcome these limitations, we propose the first unified, modular, and high-throughput red-teaming framework for systematic safety evaluation of MLLMs. By decoupling five key dimensions—model integration, data management, attack strategies, judgment mechanisms, and evaluation metrics—the framework enables scalable, automated, multi-turn, and cross-modal adversarial testing. Its core innovation lies in an adversarial kernel architecture that disentangles red-teaming logic from a high-throughput asynchronous runtime. Integrated with 37 attack methods, the framework achieves an average attack success rate of 49.14% across 20 state-of-the-art models, revealing that reasoning capability does not imply robustness against jailbreak attacks. We also release a sustainable and maintainable evaluation infrastructure to support ongoing research.

Technology Category

Application Category

📝 Abstract
The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
red-teaming
safety evaluation
adversarial attacks
AI safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming framework
multimodal LLMs
adversarial kernel
modular architecture
high-throughput evaluation
🔎 Similar Papers
X
Xin Wang
Shanghai Artificial Intelligence Laboratory
Yunhao Chen
Yunhao Chen
Fudan University
AudioDiffusion ModelsMemorization
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis
Y
Yixu Wang
Shanghai Artificial Intelligence Laboratory
Y
Yang Yao
Shanghai Artificial Intelligence Laboratory
Tianle Gu
Tianle Gu
Tsinghua University
(M)LLM SafetyPEFT
J
Jie Li
Shanghai Artificial Intelligence Laboratory
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory
Xingjun Ma
Xingjun Ma
Fudan University
Trustworthy AIMultimodal AIGenerative AIEmbodied AI
Y
Yingchun Wang
Shanghai Artificial Intelligence Laboratory
Xia Hu
Xia Hu
Google DeepMind
Deep LearningMachine LearningMultimodal