Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limitations of existing cross-lingual reasoning benchmarks, which are often constrained by Anglocentrism or an entanglement of reasoning types with cultural factors. To disentangle these dimensions, the authors propose a novel methodology leveraging 100 language-agnostic templates, combined with locally recruited annotators across 20 languages and cultural contexts to generate multiple-choice questions and system-derived true/false statements. The resulting dataset spans seven reasoning categories and 22 cultural dimensions, explicitly decoupling reasoning from culture while ensuring semantic alignment and cultural appropriateness—even for low-resource languages. The high-quality benchmark comprises 11,862 samples. Evaluations reveal that reasoning-focused models perform consistently across tasks, whereas open-source models exhibit significant degradation in local languages, particularly on culturally situated mathematical and counting problems.

Technology Category

Application Category

📝 Abstract

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning

cultural grounding

benchmark

reasoning evaluation

language bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

template-first benchmark

multilingual reasoning

cultural grounding