Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates performance disparities of large language models (LLMs) in sentiment analysis across Chinese language variants—Taiwanese Mandarin versus Mainland Mandarin. We propose the first context-aligned evaluation paradigm grounded in authentic user reviews, constructing a high-ecological-validity, low-cost benchmark by crawling and rigorously cleaning hotel reviews from platforms such as Booking.com in both variants. Under standardized sentiment classification, we systematically evaluate six prominent LLMs (e.g., GPT, Qwen, Llama series) and find a statistically significant 8.2% average accuracy drop on Taiwanese Mandarin, revealing a systemic performance gap for non-dominant training variants. This study pioneers the integration of real-world contextual alignment into language variant fairness assessment, establishing a reproducible benchmark and novel methodology for evaluating multilingual capability and linguistic equity in LLMs.

Technology Category

Application Category

📝 Abstract
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
Problem

Research questions and friction points this paper is trying to address.

Measure LLMs' performance disparities
Across different language varieties
Using contextually aligned online reviews
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextually aligned online reviews
Cost-effective benchmarking approach
Sentiment analysis across language varieties