Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mainstream mathematical reasoning benchmarks such as GSM8K exhibit strong Western cultural biases—evident in names, currencies, and everyday scenarios—potentially compromising the cross-cultural validity of LLM evaluations. Method: To address this, we systematically construct culturally adapted versions of GSM8K covering five non-Western regions—Africa, India, China, South Korea, and Japan—via prompt-driven entity-scenario substitution and rigorous human validation. Contribution/Results: We evaluate six open-weight LLMs spanning 8B–72B parameters on these variants. All models exhibit consistent performance degradation on non-Western versions, yet those with stronger inherent reasoning capabilities demonstrate significantly higher cultural robustness. This work introduces the first multi-regional, culturally diversified benchmark for mathematical reasoning and provides empirical evidence that cultural alignment critically impacts LLM mathematical reasoning fidelity—offering both a new evaluation standard and actionable insights for developing culturally resilient AI systems.

Technology Category

Application Category

📝 Abstract
Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
Problem

Research questions and friction points this paper is trying to address.

Assessing cultural bias in math problem presentation
Evaluating LLMs on culturally adapted math benchmarks
Exploring reasoning's role in mitigating cultural gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Culturally adapted GSM8K test sets
Prompt-based transformations with manual verification
Evaluated LLMs across cultural variations
🔎 Similar Papers
No similar papers found.
A
Aditya Tomar
IIT Bombay
N
Nihar Ranjan Sahoo
IIT Bombay
Ashish Mittal
Ashish Mittal
IBM Research AI
SpeechNLP
Rudra Murthy
Rudra Murthy
Staff Research Scientist, IBM
Natural Language ProcessingDeep Learning
P
Pushpak Bhattacharyya
IIT Bombay