The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study presents the first systematic comparison of mainstream large language models from China and the United States in their comprehension capabilities within Chinese cultural contexts, with a specific focus on their grasp of traditional Chinese content such as history, literature, and classical poetry. Employing a direct questioning paradigm, the evaluation assesses the performance of models including GPT-5.1, Gemini2.5Pro, DeepSeek-V3.2, and Qwen3-Max. The results demonstrate that Chinese-developed large models significantly outperform their American counterparts overall, among which Gemini2.5Pro and GPT-5.1 exhibit relatively stronger performance. The findings underscore the critical influence of training data distribution and localization strategies on models’ cultural understanding, offering empirical evidence to inform cross-cultural alignment in artificial intelligence.

Technology Category

Application Category

📝 Abstract

Cultural backgrounds shape individuals'perspectives and approaches to problem-solving. Since the emergence of GPT-1 in 2018, large language models (LLMs) have undergone rapid development. To date, the world's ten leading LLM developers are primarily based in China and the United States. To examine whether LLMs released by Chinese and U.S. developers exhibit cultural differences in Chinese-language settings, we evaluate their performance on questions about Chinese culture. This study adopts a direct-questioning paradigm to evaluate models such as GPT-5.1, DeepSeek-V3.2, Qwen3-Max, and Gemini2.5Pro. We assess their understanding of traditional Chinese culture, including history, literature, poetry, and related domains. Comparative analyses between LLMs developed in China and the U.S. indicate that Chinese models generally outperform their U.S. counterparts on these tasks. Among U.S.-developed models, Gemini 2.5Pro and GPT-5.1 achieve relatively higher accuracy. The observed performance differences may potentially arise from variations in training data distribution, localization strategies, and the degree of emphasis on Chinese cultural content during model development.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Chinese Culture

Cultural Differences

Model Evaluation

Cross-cultural AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Cultural Bias

Chinese Culture