ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

πŸ“… 2025-11-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing Chinese video question answering (VQA) benchmarks lack cultural sensitivity and linguistic adaptability, limiting their ability to accurately evaluate multimodal large language models’ (MLLMs) comprehension of complex Chinese video content. Method: We introduce C-VQA, the first comprehensive Chinese VQA benchmark, covering eight primary categories and twelve fine-grained subtasks. It integrates deep video semantic parsing with nuanced understanding of Chinese language and cultural context. We design a culture-aware, fine-grained annotation scheme and task-specific evaluation metrics, and release a high-quality test set. Contribution/Results: Experiments reveal persistent challenges in Chinese video understanding: Gemini 2.5 Pro achieves the highest overall score (77.9%), while InternVL-38B is the top-performing open-source model. C-VQA fills a critical gap in Chinese multimodal evaluation, providing a rigorous, culturally grounded benchmark to guide model development and cross-cultural adaptation research.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking MLLMs for Chinese video question answering
Evaluating culturally-aware video understanding in Chinese context
Assessing multimodal models on complex Chinese linguistic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for Chinese video question answering
Dataset with culturally-aware evaluation metrics
Tests multimodal models on complex video content
πŸ”Ž Similar Papers
No similar papers found.