🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic robustness evaluation under manipulated video content. Method: This work introduces Ro-Bench, the first dynamic out-of-distribution (OOD) counterfactual video benchmark for evaluating MLLM robustness. Leveraging text-driven counterfactual video generation, it synthesizes high-quality, diverse videos by editing style, objects, and backgrounds, and constructs a test set via hybrid human annotation and automated pipelines. Contribution/Results: Experiments across eight state-of-the-art video MLLMs reveal substantial performance degradation under counterfactual scenarios. Fine-tuning on Ro-Bench improves in-benchmark accuracy by 21.73% and yields an average 12.78% gain across 20 tasks on MVBench. This work provides the first systematic characterization of robustness bottlenecks in MLLM-based video understanding and establishes a scalable framework for both evaluation and robustness enhancement.
📝 Abstract
Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.