How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs

๐Ÿ“… 2024-07-12
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study identifies a structural gap in Chinaโ€™s large language model (LLM) governance regarding linguistic diversity: neither national language policy nor AI strategy regulates the internal language composition of LLMs, and technical documentation typically declares support for only Chinese and English. Method: We systematically evaluate the real-world multilingual performance of six open-source Chinese LLMs across 18 languages, integrating multilingual benchmarking, pretraining data language distribution analysis, technical report text mining, and cross-comparative analysis of domestic/international policies and model capabilities. Contribution/Results: We provide the first empirical evidence that Chinese LLMs achieve multilingual competence comparable to leading international modelsโ€”yet their language selection remains unguided by policy and lacks transparency in disclosure. Our findings bridge a critical gap in AI governance between language policy and model practice, offering actionable evidence and policy levers for building responsible, inclusive multilingual AI ecosystems.

Technology Category

Application Category

๐Ÿ“ Abstract
Contemporary language models are increasingly multilingual, but Chinese LLM developers must navigate complex political and business considerations of language diversity. Language policy in China aims at influencing the public discourse and governing a multi-ethnic society, and has gradually transitioned from a pluralist to a more assimilationist approach since 1949. We explore the impact of these influences on current language technology. We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages, spanning a wide range of Chinese, Asian, and Anglo-European languages. Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs. Similarly, the models' technical reports also show lack of consideration for pretraining data language coverage except for English and Mandarin Chinese. Examining Chinese AI policy, model experiments, and technical reports, we find no sign of any consistent policy, either for or against, language diversity in China's LLM development. This leaves a puzzling fact that while China regulates both the languages people use daily as well as language model development, they do not seem to have any policy on the languages in language models.
Problem

Research questions and friction points this paper is trying to address.

Impact of China's language policy on LLM development
Lack of language diversity consideration in Chinese LLMs
No consistent policy on languages in Chinese LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates six open-source multilingual Chinese LLMs
Tests performance on 18 diverse languages
Examines lack of language diversity policy
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Andrea Wen-Yi Wang
Cornell University
U
Unso Eun Seo Jo
Cornell University
L
Lu Jia Lin
Seoul National University
David Mimno
David Mimno
Associate Professor, Cornell University
Machine LearningText MiningTopic ModelingDigital Humanities