GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor performance of existing large language models in multi-step mathematical reasoning for low-resource languages such as Bengali, where models often rely on English-based reasoning followed by translation, and reinforcement learning struggles to converge under sparse reward conditions. To tackle these challenges, the authors introduce Ganit, the first Bengali mathematical reasoning dataset annotated with automatically assigned difficulty labels, and propose Curriculum-GRPO—a training framework that integrates supervised fine-tuning with Group Relative Policy Optimization. This framework incorporates pass@k-based difficulty estimation, a verifiable reward mechanism, and difficulty-aware curriculum sampling. Experiments show that the resulting model, GanitLLM-4B, achieves accuracy gains of 8.0 and 7.0 percentage points on Bn-MGSM and Bn-MSVAMP, respectively, increases the proportion of Bengali reasoning tokens from 14% to over 88%, and reduces average solution length from 943 to 193 tokens.

Technology Category

Application Category

📝 Abstract
We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics,"Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
Problem

Research questions and friction points this paper is trying to address.

Bengali mathematical reasoning
low-resource languages
reward sparsity
multi-step math
language-specific reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum-GRPO
difficulty-aware sampling
Bengali mathematical reasoning
verifiable rewards
low-resource LLM
Shubhashis Roy Dipta
Shubhashis Roy Dipta
University of Maryland, Baltimore County
Natural Language ProcessingReasoningMultimodal Understanding
K
Khairul Mahbub
University of North Carolina at Charlotte
N
Nadia Najjar
University of North Carolina at Charlotte