TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit significant deficiencies in understanding non-Western cultural contexts—particularly China’s intangible cultural heritage—due to Western-centric training data and evaluation benchmarks. Method: We introduce TCC-Bench, the first bilingual (Chinese–English) visual question answering benchmark explicitly designed for traditional Chinese culture. It covers diverse domains—including artifacts, folk customs, and domestic animation—and pioneers “implicit cultural concept questioning” to mitigate linguistic bias and data leakage. Annotation employs a semi-automatic pipeline integrating GPT-4o-assisted generation with expert human validation to ensure cultural fidelity and evaluation robustness. Contribution/Results: Comprehensive evaluation across 30+ state-of-the-art MLLMs reveals an average accuracy gap of 42.6% relative to human performance on cultural understanding tasks, exposing critical bottlenecks in culturally grounded multimodal reasoning. TCC-Bench establishes a standardized, reproducible infrastructure for developing and evaluating culture-adapted multimodal models.

Technology Category

Application Category

📝 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) have significantly enhanced the ability of artificial intelligence systems to understand and generate multimodal content. However, these models often exhibit limited effectiveness when applied to non-Western cultural contexts, which raises concerns about their wider applicability. To address this limitation, we propose the extbf{T}raditional extbf{C}hinese extbf{C}ulture understanding extbf{Bench}mark ( extbf{TCC-Bench}), a bilingual ( extit{i.e.}, Chinese and English) Visual Question Answering (VQA) benchmark specifically designed for assessing the understanding of traditional Chinese culture by MLLMs. TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts. We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage. The benchmark also avoids language bias by preventing direct disclosure of cultural concepts within question texts. Experimental evaluations across a wide range of MLLMs demonstrate that current models still face significant challenges when reasoning about culturally grounded visual content. The results highlight the need for further research in developing culturally inclusive and context-aware multimodal systems. The code and data can be found at: https://github.com/Morty-Xu/TCC-Bench.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' understanding of traditional Chinese culture
Addressing limited effectiveness in non-Western cultural contexts
Developing culturally inclusive multimodal AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual VQA benchmark for Chinese culture
Semi-automated pipeline with GPT-4o and human curation
Culturally rich and visually diverse data sources
🔎 Similar Papers
No similar papers found.
P
Pengju Xu
Beijing University of Posts and Telecommunications, Beijing, China
Y
Yan Wang
Beijing University of Posts and Telecommunications, Beijing, China
S
Shuyuan Zhang
Beijing University of Posts and Telecommunications, Beijing, China
X
Xuan Zhou
Beijing University of Posts and Telecommunications, Beijing, China
X
Xin Li
Changchun University of Science and Technology, Beijing, China
Y
Yue Yuan
Beijing University of Posts and Telecommunications, Beijing, China
F
Fengzhao Li
Beijing University of Posts and Telecommunications, Beijing, China
S
Shunyuan Zhou
North China University of Technology, Beijing, China
Xingyu Wang
Xingyu Wang
Nanjing University of Posts and Telecommunications
NLP
Y
Yi Zhang
Beijing University of Posts and Telecommunications, Beijing, China
H
Haiying Zhao
Beijing University of Posts and Telecommunications, Beijing, China