Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical trustworthiness risks—factual inaccuracies, harmful content, bias, hallucinations, and privacy leakage—in video large language models (Video-LLMs). To this end, we propose the first five-dimensional trustworthiness evaluation framework, encompassing factual accuracy, safety, robustness, fairness, and privacy. We introduce Trust-videoLLMs, a benchmark comprising 30 dynamic visual and cross-modal tasks, built upon a spatiotemporally aware synthetic-and-annotated hybrid dataset. Methodologically, we innovate with multimodal prompt engineering, dynamic video sampling and perturbation injection, cross-modal consistency verification, and quantitative privacy risk analysis. Comprehensive evaluation of 23 state-of-the-art Video-LLMs reveals significant vulnerabilities under dynamic scenes and cross-modal perturbations. Finally, we open-source an extensible evaluation toolkit to advance standardization in trustworthy video AI.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal large language models for video understanding (videoLLMs) have improved their ability to process dynamic multimodal data. However, trustworthiness challenges factual inaccuracies, harmful content, biases, hallucinations, and privacy risks, undermine reliability due to video data's spatiotemporal complexities. This study introduces Trust-videoLLMs, a comprehensive benchmark evaluating videoLLMs across five dimensions: truthfulness, safety, robustness, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses dynamic visual scenarios, cross-modal interactions, and real-world safety concerns. Our evaluation of 23 state-of-the-art videoLLMs (5 commercial,18 open-source) reveals significant limitations in dynamic visual scene understanding and cross-modal perturbation resilience. Open-source videoLLMs show occasional truthfulness advantages but inferior overall credibility compared to commercial models, with data diversity outperforming scale effects. These findings highlight the need for advanced safety alignment to enhance capabilities. Trust-videoLLMs provides a publicly available, extensible toolbox for standardized trustworthiness assessments, bridging the gap between accuracy-focused benchmarks and critical demands for robustness, safety, fairness, and privacy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating trustworthiness in video multimodal LLMs across dimensions
Assessing limitations in dynamic visual scene understanding
Addressing gaps in robustness, safety, fairness, and privacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for videoLLM trustworthiness evaluation
Dynamic visual and cross-modal interaction assessment framework
Public toolbox for standardized trustworthiness assessments
🔎 Similar Papers
No similar papers found.
Youze Wang
Youze Wang
Hefei University of Technology
Z
Zijun Chen
Hefei University of Technology
Ruoyu Chen
Ruoyu Chen
Institute of Information Engineering, Chinese Academy of Sciences.
Explainable AITrustworthy AIFoundation Model
S
Shishen Gu
Hefei University of Technology
Yinpeng Dong
Yinpeng Dong
Tsinghua University
Machine LearningDeep LearningAI Safety
H
Hang Su
Tsinghua University
J
Jun Zhu
Tsinghua University
M
Meng Wang
Hefei University of Technology
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition
W
Wenbo Hu
Hefei University of Technology