🤖 AI Summary
Existing mathematical reasoning benchmarks are heavily English-centric, neglecting the needs of low-resource languages. This work addresses the critical gap in formal and informal mathematical text understanding for Romanian—a morphologically rich, low-resource language—where current models exhibit significant deficiencies.
Method: We introduce RoMath, the first specialized, multi-level evaluation benchmark for Romanian mathematical reasoning, comprising three data categories: national high-school graduation exams, mathematics competitions, and controllably synthesized problems. RoMath is deeply grounded in Romania’s educational curriculum and linguistic morphology, avoiding naive machine translation. Data construction combines expert-authored items with rule-guided synthetic generation to ensure high-quality, linguistically accurate annotations.
Contribution/Results: Comprehensive evaluation of leading open-source LLMs reveals substantial performance deficits in Romanian mathematical reasoning. All code and datasets are publicly released, establishing a reproducible, localization-aware assessment infrastructure that challenges the English-centric paradigm and advances multilingual mathematical AI.
📝 Abstract
Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets: RoMath-Baccalaureate, RoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. We make the code and dataset available.