McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing bias evaluation datasets predominantly target English language and North American cultural contexts, lacking fine-grained, multi-task benchmarks tailored to Chinese linguistic and cultural norms. To address this gap, we propose McBE—the first multi-task bias evaluation benchmark specifically designed for Chinese language and culture. McBE encompasses 12 primary bias categories and 82 fine-grained subcategories, comprising 4,077 high-quality, human-constructed instances supporting five distinct evaluation tasks. It introduces the first Chinese multi-dimensional joint evaluation framework, enabling both fine-grained bias classification and systematic quantification. Empirical evaluation reveals significant and pervasive systemic biases across major Chinese large language models—particularly along gender, regional, and occupational dimensions. McBE fills a critical void in bias assessment for non-English, non-Western cultural settings and establishes a reproducible, extensible evaluation infrastructure to advance fairness research for Chinese LLMs.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Measure biases in Chinese-focused large language models

Address lack of multi-task bias evaluation datasets

Evaluate bias across diverse categories and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task Chinese Bias Evaluation Benchmark (McBE)

4,077 instances covering 12 bias categories

Five evaluation tasks for comprehensive bias measurement

🔎 Similar Papers

No similar papers found.