🤖 AI Summary
This work investigates performance disparities of large language models (LLMs) in sentiment analysis across Chinese language variants—Taiwanese Mandarin versus Mainland Mandarin. We propose the first context-aligned evaluation paradigm grounded in authentic user reviews, constructing a high-ecological-validity, low-cost benchmark by crawling and rigorously cleaning hotel reviews from platforms such as Booking.com in both variants. Under standardized sentiment classification, we systematically evaluate six prominent LLMs (e.g., GPT, Qwen, Llama series) and find a statistically significant 8.2% average accuracy drop on Taiwanese Mandarin, revealing a systemic performance gap for non-dominant training variants. This study pioneers the integration of real-world contextual alignment into language variant fairness assessment, establishing a reproducible benchmark and novel methodology for evaluating multilingual capability and linguistic equity in LLMs.
📝 Abstract
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.