Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses a significant coverage gap in large language models when handling closely related language varieties—such as Cantonese–Mandarin or Bavarian–German—where regional knowledge exclusive to local Wikipedia editions remains largely inaccessible, leading to pronounced informational asymmetry. The work presents the first systematic analysis of this deficit and introduces a human-annotated, challenging question-answering dataset to evaluate model performance. By integrating contextual priming, machine translation, and a hierarchical evaluation framework, the research demonstrates that models struggle with region-specific queries absent explicit local knowledge. However, providing lead-section context substantially improves answer accuracy, with further gains achieved through strategic translation, thereby affirming the unique value of local Wikipedias as critical sources of regional knowledge.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Information Asymmetry

Language Varieties

Large Language Models

Question Answering

Local Knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

information asymmetry

language varieties

question answering