RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic robustness evaluation under manipulated video content. Method: This work introduces Ro-Bench, the first dynamic out-of-distribution (OOD) counterfactual video benchmark for evaluating MLLM robustness. Leveraging text-driven counterfactual video generation, it synthesizes high-quality, diverse videos by editing style, objects, and backgrounds, and constructs a test set via hybrid human annotation and automated pipelines. Contribution/Results: Experiments across eight state-of-the-art video MLLMs reveal substantial performance degradation under counterfactual scenarios. Fine-tuning on Ro-Bench improves in-benchmark accuracy by 21.73% and yields an average 12.78% gain across 20 tasks on MVBench. This work provides the first systematic characterization of robustness bottlenecks in MLLM-based video understanding and establishes a scalable framework for both evaluation and robustness enhancement.

Technology Category

Application Category

📝 Abstract
Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLM robustness against manipulated video content
Assessing performance degradation with counterfactual video test sets
Improving video understanding through counterfactual data fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Creates counterfactual videos by editing Style, Object, Background
Evaluates MLLM robustness using dynamic out-of-distribution test sets
Fine-tunes MLLMs with counterfactual data to enhance robustness
🔎 Similar Papers
No similar papers found.
Z
Zixi Yang
Beijing University of Posts and Telecommunications
J
Jiapeng Li
Beijing University of Posts and Telecommunications
M
Muxi Diao
Beijing University of Posts and Telecommunications
Y
Yinuo Jing
Beijing University of Posts and Telecommunications
Kongming Liang
Kongming Liang
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionMachine Learning