On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the generalization capability and fairness of large language models (LLMs) across culturally diverse measurement systems (e.g., currency, units). Addressing three core research questions—(RQ1) inherent measurement system preferences, (RQ2) accuracy disparities across systems, and (RQ3) whether reasoning mitigates bias against non-dominant systems—the authors construct a benchmark dataset covering seven open-source LLMs and diverse cross-cultural measurement scenarios. They systematically evaluate chain-of-thought (CoT) and other reasoning-based prompting techniques. The work首次 identifies “measurement system representation bias” as a source of latent computational inequity. Empirical results reveal significant dominant-system preference across all models, with substantially lower accuracy on non-dominant systems. CoT improves accuracy by up to 37%, yet increases average response length by 2.1×, exposing a cultural dimension of reasoning cost discrimination.

Technology Category

Application Category

📝 Abstract
Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.
Problem

Research questions and friction points this paper is trying to address.

LLMs default to dominant measurement systems, neglecting underrepresented cultures
LLMs show performance instability across different measurement systems
Mitigating bias requires costly reasoning methods, increasing test-time compute
Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing LLMs across diverse measurement systems
Using chain-of-thought to improve accuracy
Analyzing measurement system bias in LLMs
🔎 Similar Papers
No similar papers found.
Minh Duc Bui
Minh Duc Bui
Ph.D. Student, Johannes Gutenberg University Mainz
Natural Language ProcessingNLPMachine Learning
K
Kyung Eun Park
University of Mannheim, Germany
G
Goran Glavavs
Center For Artificial Intelligence and Data Science, University of Würzburg, Germany
Fabian David Schmidt
Fabian David Schmidt
Member of Technical Staff, Cohere
Natural Language Processing
K
K. Wense
University of Colorado Boulder, USA