ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing Chinese video question answering (VQA) benchmarks lack cultural sensitivity and linguistic adaptability, limiting their ability to accurately evaluate multimodal large language models’ (MLLMs) comprehension of complex Chinese video content. Method: We introduce C-VQA, the first comprehensive Chinese VQA benchmark, covering eight primary categories and twelve fine-grained subtasks. It integrates deep video semantic parsing with nuanced understanding of Chinese language and cultural context. We design a culture-aware, fine-grained annotation scheme and task-specific evaluation metrics, and release a high-quality test set. Contribution/Results: Experiments reveal persistent challenges in Chinese video understanding: Gemini 2.5 Pro achieves the highest overall score (77.9%), while InternVL-38B is the top-performing open-source model. C-VQA fills a critical gap in Chinese multimodal evaluation, providing a rigorous, culturally grounded benchmark to guide model development and cross-cultural adaptation research.

Technology Category

Application Category

📝 Abstract

This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking MLLMs for Chinese video question answering

Evaluating culturally-aware video understanding in Chinese context

Assessing multimodal models on complex Chinese linguistic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for Chinese video question answering

Dataset with culturally-aware evaluation metrics

Tests multimodal models on complex video content

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs