๐ค AI Summary
Large language models (LLMs) exhibit weak generalization in mathematical reasoning, and conventional token-level autoregressive modeling fails to capture the structured, stepwise nature of human mathematical problem-solving.
Method: We propose ClozeMathโa novel equation-level cloze fine-tuning paradigm that directly predicts masked key equations within solution derivations, aligning more closely with human-like structured reasoning. Our approach integrates LLM-based text infilling fine-tuning, chain-of-thought (CoT) prompting, and beam search decoding, complemented by a systematic ablation framework.
Contribution/Results: Evaluated on GSM8K, MATH, and GSM-Symbolic benchmarks, ClozeMath consistently outperforms the Masked Thought baseline across all metrics. Results demonstrate that equation-level modeling significantly improves both accuracy and robustness in mathematical reasoning. This work establishes a new training paradigm for enhancing LLMsโ mathematical capabilities through structural, derivation-aware supervision.
๐ Abstract
The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.